Fetching Android Market Stats with Python, MozRepl, and BeautifulSoup

A few weeks ago I was quite keen on the idea of gathering stats and creating charts to track the popularity of my Android apps. Alas, despite digging around in various packages and experimenting with cURL, I could never seem to get logged in programmatically to the Android Marketplace Developer Console. So I gave up to continue working on my next app. Now I’ve come up with another reason to do some screen-scraping, so I thought I should give this another try.

Half the magic here belongs to a very cool Firefox plugin called MozRepl which lets you open a telnet connection to Firefox and interact with it via Javascript. Awesome, no?

All you have to do is ask MozRepl to go to the Developer Console, download the HTML, and run it through BeautifulSoup (the rest of the magic) to extract the data.

It turns out to be just slightly trickier because MozRepl needs to talk to Python via Telnet. I suppose this script could be setup in cron to grabs stats a couple of times each day. I think I’m just gonna run it manually every once in awhile.

import BeautifulSoup, re, time
import os, telnetlib
# Install MozRepl Plugin
# http://wiki.github.com/bard/mozrepl
# Setup MozRepl to start automatically with FF, check that port number is 4242
# Login to Developer Console once manually so login credentials get saved
 
# Create a new profile and set this accordingly
# http://support.mozilla.com/en-US/kb/Managing+profiles
profile = 'my_firefox_profile'
 
# go to Developer Console using new profile
url = 'http://market.android.com/publish/Home'
os.system("firefox -no-remote -P %s %s &" % (profile, url))
time.sleep(5) #wait a sec for FF to start
 
#connect to MozRepl and fetch HTML
t = telnetlib.Telnet("localhost", 4242)
t.read_until("repl>")
t.write("content.document.body.innerHTML")
body = t.read_until("repl>")
t.close()
 
#is there a better way to do this?
os.system("killall -9 firefox")
 
#yank stats out of HTML
now = time.strftime("%Y-%m-%d %H:%M:%S")
soup = BeautifulSoup.BeautifulSoup(body)
table = soup.find("div", { "class" : "listingTable" })
for row in table.findAll('div', {'class':'listingRow'}):
  app = row.find("div", { "class" : "listingApp" })
  rating = row.find("div", { "class" : "listingRating" })
  stats = row.find("div", { "class" : "listingStats" })
  if app and rating and stats:
    name = app.next.next.string
    total = stats.next.string.split()[0]
    active = stats.next.nextSibling.string.split()[0]
    nratings = rating.next.string[1:-1]
    stars = len(rating.findAll(attrs={'style':re.compile("scroll -78px")}))
    print now, name, total, "total", active, "active", nratings, "ratings", stars, "stars"
#that's it, now maybe save these to a CSV or a log file..

I debated whether to show my actual numbers. Here you go, enjoy:

2009-04-03 17:45:15 Measure Stuff 4 total 1 active 2 ratings 1 stars
2009-04-03 17:45:15 Measure Stuff Lite 3006 total 995 active 28 ratings 2 stars
2009-04-03 17:45:15 RGB Probe 4 total 2 active 2 ratings 1 stars
2009-04-03 17:45:15 Thumb Maze 112 total 39 active 8 ratings 3 stars
2009-04-03 17:45:15 Thumb Maze Lite 16313 total 8813 active 172 ratings 3 stars

Uh oh, those numbers are not very good at all! So far my plan to live off Android looks doomed, but maybe things will pick up in the future. Two of the apps appear twice because there is a paid version and a free one. Can you tell which is which? =). Also, I think there is something wrong with RGB Probe. I’ve gotten a couple of e-mails saying the download failed.

So I hope folks will find this script useful. Obviously, use of this code is completely at your own risk. Screen scrapers are an arguably questionable enterprise, so don’t blame me if you hose your Firefox profile or Google gets mad at you.

Also, if anyone knows the cURL incantation that will do the same thing sans Firefox, I’d love to hear it. I kept getting a 302 response and never quite figured it out. I’ve taken several suggestions based on other Google services that ’should work’, but for some reason don’t.

There are certainly pros and cons to screen scraping through the browser; I’ll only point out two advantages: First, you get ‘real’ Javascript executed right in Firefox. With many of the big data sites being Ajax-heavy, simply fetching the HTML without executing the JS only gets you halfway there. Second, it is possible to detect and block screen scrapers by looking for unusual or suspicous request patterns. I don’t know if any sites actually do this, but it could be done. For example, a simple fetch via wget looks different to a server than a fetch with Firefox and it goes beyond User-Agents. The css, images, javascript, and such will also be fetched in a particular way and a server can look for anything unusual in the order or timing with which resources are requested. Sound crazy? You’re right! It probably is and I’m not sure anybody actually does this. In fact, it very possibly wouldn’t work well at all in practice. For one, it could screw up text-only browsers. But I think it is still within the realm of possibility..

Now for balance, two downsides: First, the browser needs a window to run in. This means it is kinda slow, hijacks your computer for a few seconds, and doesn’t really lend itself to parallelization. Second, tools like cURL and wget and many language-specific libraries are practically standard.

Tags: ,

5 Responses to “Fetching Android Market Stats with Python, MozRepl, and BeautifulSoup”

  1. I have recently been attempting this, using a Python module called mechanize. Thus far I have manged to authenticate with the marketplace server and grab the marketplace webpage contents…..however the actual stats are generated via a JSON post request which I am having difficulty emulating…Feel free to try and get it working:

    def gatherMarketplace():
    print “gathering marketplace stats for “+product.product_name
    ##Initialise cookies
    cookiejar = cookielib.LWPCookieJar()
    cookiejar = urllib2.HTTPCookieProcessor(cookiejar)
    opener = urllib2.build_opener(cookiejar)
    urllib2.install_opener(opener)

    #Create browser object
    br = Browser()
    br.set_handle_robots(False)
    response = br.open(“https://www.google.com/accounts/ServiceLogin?service=androiddeveloper&passive=true&nui=1&continue=http%3A%2F%2Fmarket.android.com%3A80%2Fpublish&followup=http%3A%2F%2Fmarket.android.com%3A80%2Fpublish”)
    responseBody = response.read()
    form = br.select_form(nr=0)
    br["Email"] = “gbjgandroid”
    br["Passwd"] = “********”
    response2 = br.submit()
    #At this point response2.read() will be you android market home page

    This is the post request that I am trying to emulate:

    5|0|4|http://market.android.com/publish/gwt/|20D26AAE72C63DE79F3EE065E818E545|com.google.wireless.android
    .vending.developer.shared.AppEditorService|getFullAssetInfosForUser|1|2|3|4|0|

  2. Mike says:

    nice, but wouldn’t that be easier with the iMacros Firefox addon?

  3. Will says:

    The free version of the iMacros Firefox addon doesn’t allow you to scrape webpages so isn’t a choice for some.

    MozRepl works or at least it did for an AJAX-heavy website that seemed otherwise impossible to scrape with mechanize.

    There doesn’t seem to be a convenient way to check the ‘readyState’ of a AJAX request so the automation can be flaky – I had several time.sleep’s interspersed in the code.

  4. Henrik says:

    Hi there,

    You’ll have to excuse me, but I get this:

    os.system(“firefox -no-remote -P %s &” % (profile, url))
    NameError: name ‘profile’ is not defined

    I don’t think I understand how to handle that profile?

    Henrik

  5. craiget says:

    Oops.. I may have mangled the script when copying it between blogs. The 2nd profile thing is not strictly necessary. I have two Google accounts, so I am using two profiles here so they don’t stomp on each other. You can probably drop the “-P profile” flag altogether without any problem. There’s more info about setting up a second profile here: http://support.mozilla.com/en-US/kb/Managing+profiles.

    There was another typo on that line too – both should be fixed now.

Leave a Reply