PythonTop40 and PythonTop40Server released

2014-12-14_17-26-11As mentioned in my previous post, I’ve started to develop a series of Python-based data sources that will hopefully promote the adoption of coding within UK schools.

I’m referring to the concept as Code-Further, and simply put this looks to allow students to access a series of relevant and hopefully meaningful data over the Internet using Python.

The first release of Python code is PythonTop40, which provides API access to the UK Top 40 album and singles data. The chart information is hosted by another Python project which can be found here.

PythonTop40 can be installed from PyPI using pip as follows:

pip install pythontop40

Accessing the chart information from Python is relatively simple – as shown in the Python 3 snippet below:

from pythontop40 import Top40

top40 = Top40()

for entry in top40.singles:
    print(entry.position, entry.title, entry.artist)

and this will produce output something like this:

.
.
.
10 Real Love BY Clean Bandit & Jess Glynne
11 A Fairytale Of New York (feat. Kirsty MacColl) BY The Pogues
12 Steal My Girl BY One Direction
13 All About That Bass BY Meghan Trainor
14 Like I Can BY Sam Smith
15 Dangerous (feat. Sam Martin) BY David Guetta
16 All I Want For Christmas Is You BY Mariah Carey
.
.
.
40 All Of Me BY John Legend

The source code for PythonTop40 can be found on BitBucket, and I welcome contributions or reported issues. The PythonTop40 documentation can be found on ReadTheDocs.

Danny Goodall

PythonTop40 – Get the UK Top 40 Albums and Singles from Python

uk-top-40As part of my efforts to encourage kids to get into coding, I’m developing a series of APIs that provide meaningful and hopefully relevant information that can be accessed during coding lessons or after-school clubs.

The first of these efforts is the PythonTop40 API, and whilst this is currently still under development and hasn’t been released, I need a home page for it – hence this post.

I’ll add more details over time, but for the moment I’d like to acknowledge the work done by Ben Major on scraping the BBC’s web site to provide the raw data that I’m using for this API. Kudos Ben!

The API source code is hosted on BitBucket. I’ll edit this post with more details as and when I have them.

Danny Goodall

Heroku Versus AppEngine and Amazon EC2 – Where Does it fit in?

I’ve just had a really pleasant experience looking at Heroku – the ‘cloud application platform’ from Salesforce.com but it’s left me wondering where it fits in.

A mate of mine who works for Salesforce.com suggested I look at Heroku after I told him that I’d had some good and bad experiences with Google’s AppEngine and Amazon’s EC2. I’d been looking for somewhere to host some Python code that I’d written in my spare time and I had looked at both AppEngine and EC2 and found pros and cons with both of them.

As it turns out it was a good suggestion  because Heroku’s approach is very good for the spare-time developer like me. That’s not to say that it’s only an entry level environment – I’m sure it will scale with my needs, but getting up and running with it is very easy.

Having had some experience of the various platforms, I’m wondering where Heroku fits in. My high-level thoughts…

Amazon’s EC2 – A Linux prompt in the sky

Starting with EC2, I found EC2 the simplest concept to get to grips with but by far the most complex to configure. For the uninitiated, EC2 provides you with a machine instance in the cloud which is a very simple concept to understand. Every time you start a machine instance you effectively get a Linux prompt, of varying degrees of power and capacity, in the sky. What this means is that you have to manually configure the OS, database, web infrastructure, caching, etc. This is excellent in that it gives unrivalled flexibility and after all, we’ve all had to configure our development and test environment anyway so we should understand the technology.

But imagine that you’ve architected your system to have multiple machines hosting the database, multiple machines processing logic and multiple web servers managing user load; you have to configure each of these instances yourself. This is non-trivial and if you want to be able to flexibly scale each of the machine layers then you own that problem yourself (although there are after market solutions to this too).

But what it does mean is that if you’re taking a system that is currently deployed on internal infrastructure and deploying it to the cloud, you can mimic the internal configuration in the cloud. This in turn means that the application itself does not necessarily need to be re-archtected.

The sheer amount of additional infrastructure that Amazon makes available to cloud developers (Queuing, cloud storage,  MapReduce farms, storage, caching, etc) coupled with their experience of managing both the infrastructure and the associated business models, makes Amazon an easy choice for serious cloud deployments.

Google AppEngine – Sandbox deployment dumbed down to the point of being dumb?

So I’m a fan of Google, in the same way that I might say I’m a fan of oxygen. It’s ominpresent and it turns out that it’s easier to use a Google service than not – for pretty much all of Google’s services. They really understand the “giving crack cocaine free to school kids” model of adoption. They also like Python (my drug of choice) and so using AppEngine was a natural choice for me. AppEngine presents you with an abstracted view of a machine instance that runs your code and supports Java, Python or Google’s new Go language. With such language restrictions it’s clear to see that, unlike EC2, Google is presenting developers with a cosseted, language-aware, sand-boxed environment in which to run code. The fact that Google tunes the virtual machines to host and scale code optimally is, depending on your mindset, either a very good thing or close to being the end of the world. For me, not wanting, knowing how to, or needing to push the bounds of the language implementation, I found the AppEngine environment intuitive and easy. It’s Google right?

But some of the Python restrictions, such as not being able to use modules that contain C code are just too restrictive. Google also doesn’t present the developer with a standard SQL database interface, which adds another layer of complexity as you have to use Google’s high replication datastore.  Google would argue, with some justification I’m sure, that you can’t use a standard SQL database in an environment when the infrastructure that happens to be running your code at any given moment could be anywhere in Google’s data centres worldwide. But it meant that my code wouldn’t port without a little bit of attention.

The other issue I had with Google is that the pricing model works from quotas for various internal resources. Understanding how your application is likely to use these resources and therefore arriving at a projected cost is pretty difficult. So whilst Google has made getting code into the cloud relatively easy, it’s also put in place too many restrictions to make it of serious value.

Heroku- Goldilock’s porridge too hot, too cold or just right?

It would be tempting, and not a little symmetrical, to place Heroku squarely between the two other PaaS environments above. And whilst that is sort of where it fits in my mind, it would also be too simplistic. Heroku does avoid the outright complexity of EC2 and seems to also avoid some of the terminal restrictions (although it’s early days) of AppEngine. But the key difference with EC2 lies in how Heroku manages Dynos (Heroku’s name for an executing instance). To handle scale and to maximise use of its own resources, Heroku runs your code only for the specific instance that it is being executed. After that, the code, the machine instance and any data it contained are forgotten. This means that things like a persistent file system or a having a piece of your code always running cannot be relied upon.

These problems are pretty easily surmountable. Amazon’s S3 can be used as a persistent file store and Heroku apps can also launch a worker process that can be relied upon to not be restarted in the same way as the other Dyno web processes.

Scale is managed intelligently by Heroku in that you simply increase the number of web and worker processes that your application has access to – obviously this also has an impact on the cost. Finally there is an apparently thriving add-on community that provides (at additional monthly cost) access to caching, queuing and in fact any type of additional service that you might otherwise have installed for free on your Amazon EC2 instance.

Conclusion

I guess the main conclusion of this simple comparison is that whilst Heroku does make deploying web apps simple, you can’t simple take code already deployed on internal servers and git commit it to Heroku.com. Heroku forces you to think about the interactions your application will have with its new deployment environment, because if it didn’t, your app wouldn’t scale. This is also true of Google’s AppEngine, but the restrictions that AppEngine places on the type of code you can run makes it of limited value to my mind. These restrictions do not appear to be there with Amazon EC2. You can simply take an internally hosted system and build a deployment environment in the cloud that mimics the current environment. But at some point down the line, you’re going to have to think about making the code a better cloud citizen. With EC2, you’re simply able to defer the point of re-architecture. And the task of administering EC2 is a full time job in itself and should not be underestimated. Heroku is amazingly simply by comparison.

Anyway, those are my top of mind thoughts on the relative strengths and weaknesses of the different cloud hosting solutions I’ve personally looked at. Right now I have to say that Heroku really does strike an excellent balance between ease and capability. Worth a look.

Danny Goodall

Inserting Google Chart Tools Visualizations into WordPress

EDIT: When I wrote this post no plugin existed to create and embed Google Charts within a WordPress blog. I’ve recently been made aware of the ChartBoot for WordPress plugin which seems to do exactly what I needed – although I haven’t looked at the plugin myself at the moment. It might be worth taking a look.

I needed to insert charts from Google’s Chart Tools into my other WordPress blog but found that it wasn’t straightforward. There were a few WordPress plugins that claimed to be able to do it but none seemed to do exactly what I needed so I came up with the approach below.

The finished product can be found in this open source versus closed source ESBs extract from my blog. I should first say by way of explanation that I’m not a proper programmer and am certainly not well versed with PHP so I’m sure others could improve on this approach. But it certainly does what I need.

I should also say that the approach I have taken relies on already having produced the JavaScript code that produces the chart. I simply needed a way to inject that code into the WordPress blog and to have it render the chart where I wanted it. So if you’re looking for an approach that automates the chart production you won’t find it here.

First a recap of how Google’s Chart Tools work.

  1. First you load Google’s JavaScript chart library. 
  2. You then specify a function to be called when the page loads. 
  3. This function needs to contain the code to create the chart, pass the data and tell Google to render it
  4. This function must also be passed a page element (usually a <div> tag) which controls where in your web page Google’s code will render the chart.

So to accomplish this I decided that I needed to do two things.

  • Firstly, I needed to modify my WordPress theme’s header.php code to ensure that I could load Google’s JavaScript routines and build a mechanism to insert the chart code. 
  • Secondly I had to create some additional fields on WordPress’ post entry page that allowed me to:
    • Specify whether the Google chart API should be loaded for this Post (#1 in the list above) – i.e. we don’t want to load the Google Chart Tools code unless this post actually has a chart to be rendered
    • Specify the code that should be called to render each chart (#3 on the list above)

Creating a Placeholder in the HTML

The first thing to do is to create placeholder(s) for the chart(s) in the HTML of the WordPress post. To do this switch to HTML view in the WordPress editor, locate the position where you want to insert the chart and add a <div> tag. Specify an ID that is unique to this chart. So for example, if I want to insert two charts into my post I would insert the following HTML.

This is some content that goes above the chart.
<div id="medchart1">This text is replaced by the chart but WordPress seems to need some text in the DIV or it removes it when you switch back to Visual mode.</div>
And this is some content that goes below the first chart and above the second chart
<div id="medchart2">This text is replaced by the chart</div>
And this is text that goes below the second chart.

Creating the Custom Fields

OK so that takes care of telling Google where to render the cart, now I need a way to allow me to create custom fields in WordPress that also allows me to access that field from the PHP code in WordPress’ header.php. For this I found the truly excellent Advance Custom Fields plugin. This plugin has two components – a UI that allows you to create the field groups and field codes and then the logic to substitute those fields using PHP when WordPress creates the page.

So I created a number of fields as shown below:

You can see the first field lgGV is the flag to say whether the Google Visualisation Chart Tools API should be loaded for this page. I’ve then shown three other fields named szGVDrawChartFunction1..3.These fields will contain the actual JavaScript code that, when executed will draw the chart.

Expanding the first field shows more detail.

Specifying Chart Details in the WordPress Post

So, now when I edit a WordPress post I can enter values into the fields above to reflect the chart settings I want for that particular post.

As shown below:

So for the post above, I’ve effectively created a number of PHP variables that will be available from WordPress’ PHP code. These are lgGV, szGVDrawChartFunction1,2 and 3. In my example above only szGVDrawChartFunction1 and 2 have values. The 3rd is left blank.

If you click on the image to enlarge it you will also see that I’ve modified the code that produces the charts to reference the <div> IDs we created above. The first code section references

and the second code references

This tells the Google Chart code to render the chart inside those <div> blocks that we created above.

Modifying header.php to access the advanced custom fields

The Advanced Custom Fields plugin provides a number of PHP functions that you can use to access these post variables from PHP. These include:

Actually the the_repeater_field() function is only available in their paid plugin. I haven’t used that version but as you can see from the code below it would make my code much more streamline.

So, for example, to access my lgGV logical field within my WordPress theme’s header.php code,  I might write:

The author of the plugin has done a great job at making it so simple to access field codes associated with specific posts from within the WordPress PHP subsystem.

Modifying the header.php code

So now I need to modify my theme’s header.php code to examine and use these post-level fields.

Here it is.

I inserted this code immediately under the wp_head() function call in my existing header.php file. This ensures that this code block is run every time my blog creates and serves a page. I don’t know enough about WordPress internals or theme development to know if this is the correct place for every theme. But it works for me.

On to the code. I’m sure if you understand PHP better than I do it will be self explanatory but just in case.

Line 1 checks to see if the custom post field lgGV has been set to true and if it has it loads the Google libraries (lines 2-5). If not then the entire code block is skipped. For my blog I will only occasionally insert charts so I don’t want the overhead of making my vistors’ browser load libraries it isn’t going to need.

Lines 6 onwards is the declaration of the function that will be called once the page has finished loading.

Line 9 calls the first draw chart function drawChart1(). Here I assume that some code has been entered into the szGVDrawChartFunction1 custom field for this WordPress post.

Lines 10-12 check to see if anything was entered into the 2nd draw chart function and if it has,drawChart2() is called. This is repeated for the 3rd chart as well. Checking for the field being null stops me having to define and call empty functions.

Lines 17,18 and 19 define the drawChart1..3 functions. You can see that all I do is fill the function braces {} with the code that was entered into the relevant custom post fields in WordPress –  szGVDrawChartFunction1..3.

Important

A couple of important things to note here.

Firstly, the function definitions in lines 17-19 include the braces {}. So the code that is pasted inside them from the szGVDrawChartFunction1..3 custom post fields should be ONLY the code INSIDE the braces – not that actual braces themselves.

Secondly, the code that is pasted into these fields cannot currently contain new lines. I’m not sure why this is but I assume it is something to do with the way the PHP is rendered. So as a result I have to ensure that my chart definition code has all of the new lines removed. In effect each function appears on one line.

Things to Improve in Future Versions

I recognise that there is a lot of redundant code here and that it would be better using a loop to cycle through the various options but the Advanced Custom Field plugin doesn’t support repeating fields in the free version of the plugin. I did try to buy the commercial version but there seemed to be a problem with the author’s site at the time. I’ve just checked again and the store is now back online so I will get the commercial version and make my code tighter.

I also realise that inserting code into a template like this could be a security hole if someone could change the contents of my database, so I will have to build in some sort of protection too.

If you found this useful and understand PHP and/or WordPress I’d really welcome suggestions or improvements. Alternatively if you’ve found an existing plugin that can do the same I’d love to hear from you.

Dan.

Counting Syllables Accurately in Python on Google App Engine

I wanted to be able to count syllables accurately in Python and looked around for existing code that I could re-use. I found one or two routines written in PHP that looked promising so I ported them to Python but was pretty disappointed with the accuracy.

I also found a Python routine that is part of the contributed code for NLTK that was not bad but again struggled with some words. You see, I had naively thought this would be a simple exercise. I hadn’t realised that Syllable Counting in the English language is pretty difficult stuff with so many exceptions that it makes the most elegant algorithm convoluted and clumsy.

I then stumbled across this snippet of code by Jordan Boyd-Graper, via the excellent Running with Data site, and it seemed so elegant that I thought it must be too simplistic. But far from it, it is very accurate for the words it knows.

The code is shown here.

It works by looking up the pronunciation of the word in the Carnegie Mellon University’s pronunciation dictionary that is part of the Python-based Natural Language Toolkit (NLTK). This returns one or more pronunciations for the word. Then the clever bit is that the routine counts the stressed vowels in the word. The raw entry from the cmudict file for the word SYLLABLE is shown below.

SYLLABLE 1 S IH1 L AH0 B AH0 L

The stressed vowels are denoted by the string of letters ending in a number. They appear to represent the different individual pronunciations of the vowel sound. Anyway, for the words that the dictionary knows about (120,000+ I believe), this represents a very accurate method for obtaining the syllable count.

However, there is a problem. As my target environment is Google App Engine, that little line at the top of the code that says…

import nltk

…ruins your entire afternoon.

You see NLTK and Google App Engine don’t work well together due to NLTK’s recursive imports. I spent some time trying to unwind the recursive imports on cmudict so that Google App Engine would work but to no avail.

So then I thought laterally and decided to build my own structure from the cmudict file (the raw text 3.6MB file that NLTK loads and wraps an object around). My plan was as follows:

  1. Parse the raw cmudict file
  2. For every word in the file call the above syllable count routine
  3. Store the resultant syllable count in a word -> syllable lookup structure (a Python Dictionary)
  4. Pickle the resultant dictionary
  5. Un-pickle it where it is needed

And this seems to have worked quite well.

The code below builds the pickle file.

This results in a dictionary lookup that gives an accurate syllable count (or counts because some words have multiple pronunciations and therefore syllable counts) for the words it has in it’s dictionary.

Words not in the Dictionary

But what about words that the dictionary doesn’t know about? Well the way I handled that is to build a fallback routine into the code. The best (most accurate) mechanical routine I found was PHP-based and is part of Russel McVeigh’s site:

http://www.russellmcveigh.info/content/html/syllablecounter.php

I ported Russel’s code to Python and I added a couple of other exceptions that I found. Most of the mechanical syllable calculation routines I found, work on the following basic syllable rules:

  1. Count the number of vowels in the word
  2. Subtract one for any silent vowels such as the e at the end of a word
  3. Subtract any additional vowels in vowel pairs/triplets (ee, ei, eau, etc.) i.e. each group of multiple vowels scores only one vowel

The number you have left is the number of syllables. However there then follows a series of adjustments where if certain patterns are recognised in the word, syllables are added in or taken away and then finally you end up with the correct syllable count. But, even with all this adjustment it’s never accurate. But perhaps good enough for those words not in the cmudict.

So the code I’ve developed is really simple. It looks up syllable counts in the cmudict and returns the results if found and if not has a guess at the syllable count instead. I’d really like to share the code with you but something in my wordpress theme or the syntax highlighter that I use objects to something in the code. Perhaps, as I’m not a proper programmer it doesn’t like my esoteric, bastardised Hungarian notation variable names?

So I can’t post it here at the moment but will try to get that fixed. If you’re interested contact me and I’ll happily share it.

Danny Goodall

Edit – It looks like I *might* have solved that problem by using a different syntax highlighter.


							

FileZilla, SFTP and Amazon EC2

 

I’ve just made a little discovery so thought I would note it in in these pages because I’m sure I’ll need it again.

I’m investigating Amazon’s EC2 at that moment and am trying to put some code up there and struggling to use FTP securely to do it. I use FileZilla on Ubuntu and it seems that FileZilla’s site manager wants me to enter a user name password combination to login to the EC2 instance. However in accordance with Amazon’s recommendations I’m running without user passwords but am instead using public key authentication. But there appears to be nowhere to specify the local private key file location in FileZilla’s Site Manager dialogue.

The answer is that hidden in FileZilla’s settings, Edit->Settings, under the Connection-SFTP setting is a dialogue that allows you to enter the location of the local keypair file. So I added my local key pair at which point FileZilla warned me that it needed to convert my .pem format to a .ppk format. I let it do this and specified the location and name of the converted file. Then, going back to the Site Manager, I set my Amazon host Login Type to Interactive and tried again and I was straight in. Interestingly I didn’t need to tie the Site Manager entry for my EC2 host to the keypair. Just adding the keypair to the general settings as described above did the trick. No messy passwords and no compromised security.

Danny Goodall

arcanicity.appspot.com – How much jargon does your text contain?

My first Google App Engine project went live yesterday. This one deals with estimating the readability of a text when jargon such as acronyms and abbreviations are taken into account.

</marketing-bit>As I’ve mentioned before I’m developing a Natural Language Processing system called ScrewTinny (scrutiny) that analyses the language that high-tech vendors use to take their products to market. Knowing how much jargon text contains allows me to infer which audience the text is aimed at (IT Technical, IT Business, Business). And that’s important to me.</marketing-bit>

Anyway, readability indexes are not new (Flesch-Kincaid, Coleman-Liau, Gunning Fog, SMOG, etc.) and so I looked for an existing index that took jargon into account that I could use. I did a great deal of searching and even asked a number of people who have an interest in this area, but I couldn’t find one. So I developed my own – and the Goodall Arcanicity Index was born. It’s got a long way to go until it is truly accurate but I’ve now coded it in Python and decided to put it up on Google’s appspot cloud. So it’s live at:

http://arcanicity.appspot.com

It’s very simple. You enter some text, it processes it and gives you a rating for the amount of arcane content (Arcanicity) the text contains. A by-product of my text-processing routines is a mountain of related text statistics so I decided to add thosfoe to the arcanicity.appspot.com site.

As I discussed here, I also discovered the JavaScript-based Google Visualisation libraries which I will use as part of the ScrewTinny project. I wanted to get some experience with the Google routines and so for good measure I created visualisation to go along with the text statistics.

Google App Engine and NLTK

One of the interesting technical challenges involved getting the Python-based Natural Language ToolKit (NLTK) routines to work in Google’s App Engine. I had seen that it is notoriously difficult to get NLTK working with Google App Engine due to the way it recursively imports modules. But following some tips from the poster oakmad on this entry, I managed to get a small sub-section of the code working.

This discussion actually merits a separate blog entry where I can document the exact process I went through, and perhaps I will do that when I get time. But for the time being I’ll talk about the general approach. The way that I got the Punkt Sentence Tokenizer working was as follows.

I created a clean local Google App Engine instance and then I copied in the Pickled ‘english.pickle” Tokenizer object from the NLTK distribution. I un-pickled it and tried to use the resultant object’s tokenize method. This gave an error which involved some supporting imports that hadn’t happened. I then fixed the import and tried again until I got no further errors. ‘Fixing the import’ involved copying the module folder tree structure that was being complained about (one folder at a time) from a pristine NLTK installation to the Google App Engine local instance. As oakmad says, creating empty __init__.py files was important so that the module didn’t go off and grab more than was needed. As I said I should document this properly and if anyone is interested let me know and I will.

It has to be said however that I tried to use a similar technical so that I could use NLTK’s CMU pronunciation dictionary (CMUDICT). But it became very complex, very quickly and as I’m not a real programmer I gave up. But I did get to use the cmudict routines on Google App Engine by building a separate data structure. I wanted to use the cmudict routines to allow me to count syllables accurately and if I say so myself, my solution was quite ‘lateral’. That definitely does need a seperate post and so I will do that when I get time.

Danny Goodall

Nothing’s ever easy – Google App Engine…

Well, at least nothing appears easy to me when selecting the deployment configuration for ScrewTinny – my Python-based competitive marketing intelligence app.

I had planned to deploy ScrewTinny to Google App Engine and I’ve actually been very happy with how easy it is to get up and running with it on my Arcanicity Index project (more of that later). However, I found the process of generating HTML in a Python app akin to pulling my own teeth out. So I looked at templating techniques where embedded code is replaced at run-time which looked like it might be a bit more bearable. I read that Django templates are supported by Google App Engine, so I decided to take a look at Django.

Django features an object relation mapper that I really like the idea of. It’s sits on top of a MySQL database and allows me to programmatically deal with objects while it handles the persistence and retrieval to and from the underlying SQL database. But then I read that whilst Django can be deployed to Google App Engine (and it’s views are supported natively), it appears that Google’s database strategy doesn’t allow the object mapper to work.

I can understand how something like Google App Engine isn’t going to provide a generic SQL database as it wouldn’t get near the scale that was required. But it’s frustrating to have to look elsewhere if I want to use Django’s ORM.

I was urged to look at a branch of Django called Django-norel that seems to run over non-SQL databases – including the Google App Engine’s Bigtable – but I can’t take the risk that this would end up as a dead end project (even though the people responsible for this project suggest that Django-norel is going to make its way back into the main Django source code trunk).

So then I wondered whether I could run Django outside of the Google App Engine infrastructure and so asked my current host (The excellent – so far) ICDSoft and they told me that it was possible but they didn’t support the WSGI gateways that are needed. Neither, they tell me do they support web2py (which was another alternative I thought about trying) as it needed to run a background process and my shared hosting plan does not allow this. I believe that it’s possible to run web2py on Google App server but it only appears to have a Database Abstraction Layer that leaves me mapping my objects to SQL tables and back again.

So where next?

Well I’m going to take a look at Amazon EC2. I’ve signed up for an account and in theory it looks like I can start a machine image, run Django – or anything I choose, and interact with it via Amazon SQS. So when I get a bit of time I’ll dive into that.

I wish I were a proper programmer. I’ll keep you updated.

Danny Goodall

Google Chart Tools, Hmm…

So my plans for how best to visualise the output of ScrewTinny have been changing recently.

I’ve looked at using Excel to create charts manually. I’ve looked at Python chart libraries. I’ve looked at Google Docs. But now I think I might have found a winner – Google Chart Tools.

I’ve been agonising about how to deliver the results of my NLP text analysis. At the moment I’ve put a bunch of effort into creating charts in .Net so that I can inject them automatically into a PowerPoint using VSTO. It’s working fine but it limits me from being about to use the generated charts easily for web consumption from within Python – my development language of choice.

It also limits me from easily getting the charts on the web. And this is perhaps the most restricting element of my design decision. I had initially thought that I would use MS Office based documents (Word and PowerPoint) to publish my research. However I now feel that with the richness of HTML 5 and technologies such as Google Docs, I should think again.

I’ve just watched this video which shows how easy it is to embed data visualisation into a web page.

Google Chart Tools seems to be a great way to embed charts into an HTML 5 page. It also raises questions about where to store the data (Google Docs Spreadsheet?) and other logistical issues too but that’s for another day.

I’m going to be putting my Arcanicity Project code up onto Google App Engine and rather than chuck out a vanilla HTML page with the results as I had planned, I think I will use some tables and charts from the Chart Tools library.

You can have a little interactive play with the Google Visualization library here in the Google Code Playground.

I’ll keep you posted.

Danny Goodall

Google App Engine and the Arcanicity Index

I’ve decided to investigate Google App Engine (GAE) in my spare time and I need a project to test it with.

So I’m going to try to produce an on-line version of my Arcanicity Index. It will be very simple system, and because I don’t have a fag packet on which to sketch it out, I’ll list the specification below.

The user will be asked to enter some text and then click a button market Process. DeepThought will then sit and ponder for 7 and half million years and respond with “42”. Either that or it will provide the visitor with some text statistics and an Arcanicity Index estimate for the text they entered.

That should do it for a test system. I did produce a “hello world” app using GAE a long time ago so I know the principles but there are a few areas that I’m not sure about:

  • Security? Does Google protect the web server, the app, the code, etc?
  • Embedded? How can I link or embed the application in my own web page or will I have to send users to Google and hope they come back?
  • Cost? How much power will deep thought, sorry Google, provide me with free of charge? And what would be the cost of hosting it should it become moderately popular?
  • NLTK? I’m using the Natural Language Toolkit’s Punkt Tokenizer to separate the text into sentences and words but I know GAE doesn’t support NLTK out of the box.

Of those issues it’s NLTK that gives me the most concern. NLTK provides a much better (although not perfect) mechanism for detecting the end of sentences. Most other methods I’ve seen treat…

Mr. T. Brown said come A.S.A.P.

…as either 3,4 or 5 sentences. So it’s important I can use the NLTK Tokenizer. I have read of some tricks to manually install NLTK so I will probably start there. But I’m not really a proper programmer so I might have to agree with the rest of the world that there are 5 sentences in the text above.

I’ll post about my exploits as I go.

Danny Goodall