Mon, 15 Sep 2003
Introspecting pickles

A few days ago [WWW]JWZ did a "fortune" program for xscreensaver that churns out the latest in the Livejournal RSS feeds. Brilliant! I say, I've got to make my own too. Being lazy, and not wanting to do all the work, I then take advantage of [WWW]rawdog to do the dirty work of RSS aggregation for me, while I feed off its data like a vampire sucking off someone's blood. The result:

Thanks to python and the way pickles are done, I managed to get everything I needed without even the need to pry open rawdog's source code not that I need to pry open anyway, rawdog is open source. Here's how it's done:

Rawdog keeps it's data in $HOME/.rawdog/state, if you're running on unix. It's a big file if you let it run for a couple of days. I'm not sure what the data format was, but I figured that it might be a pickle, so here goes:

bash-2.05$ python 
Python 2.2.2 (#1, Dec 29 2002, 22:20:22) 
[GCC 2.96 20000731 (Linux-Mandrake 8.0 2.96-0.48mdk)] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import pickle 
>>> from pprint import pprint 
>>> data = pickle.load(file('state')) 
>>> data 
<rawdoglib.rawdog.Rawdog instance at 0x819a6c4> 

Good, so pickle managed to load the file without errors. This is a good thing. So the pickle is now loaded with a rawdoglib.rawdog.Rawdog object. Using python introspection, we can at least discover what its methods or atrributes are:

>>> dir(data) 
['__doc__', '__init__', '__module__', '_modified', 
'articles', 'feeds', 'is_modified', 'list', 'modified', 
'update', 'write'] 

Interesting indeed, looking at these, items of interest would be the feeds and articles.

>>> pprint(data.feeds) 
{'http://roughingit.wari.org/comments.xml': 
   <rawdoglib.rawdog.Feed instance at 0x82cf9f4>, 
 'http://roughingit.wari.org/index.xml': 
   <rawdoglib.rawdog.Feed instance at 0x82cf0d4>, 
 'http://www.advogato.org/rss/articles.xml': 
   <rawdoglib.rawdog.Feed instance at 0x82cd1ec>} 
 
>>> dir(data.feeds['http://roughingit.wari.org/index.xml']) 
['__doc__', '__init__', '__module__', 'etag', 
'get_html_link', 'get_html_name', 'last_update', 'link', 
'modified', 'period', 'title', 'update', 'url'] 

From here, we know that feeds contains a Feed instance that will contain useful data that we might need later. Let's look at data.articles now:

>>> len(data.articles) 
405 
>>> pprint(data.articles) 
{'febebd8faf7f483e2b86ad50fcf0a5f045920e7e': 
        <rawdoglib.rawdog.Article instance at 0x8280d94>, 
 'ff137f7f0d69c1251657c62a463fb04a2f04057e': 
        <rawdoglib.rawdog.Article instance at 0x825f71c>, 
 'ff9befad55e3651a33e7a3029d728caa0a4a8f68': 
        <rawdoglib.rawdog.Article instance at 0x82519a4>} 

There's over 400 of them, and I'm just showing the last three. So, data.articles is a dict containing Article instances. The hash of the articles are the keys. Let's look at the last one:

>>> art = data.articles['ff9befad55e3651a33e7a3029d728caa0a4a8f68'] 
>>> dir(art) 
['__doc__', '__init__', '__module__', 'added', 'can_expire', 
'description', 'feed', 'hash', 'last_seen', 'link', 'title'] 
 
>>> art.title 
'Feedback:"Welcome to my blog" by tjk on 1060785639.53' 
 
>>> art.description 
'Just started using pyblosxom and it fits my needs well. 
However, one think I cannot figure out is how you are 
creating your "read more" links in your blog files in such a 
way that in the overview page you only get the blurb before 
read-more while the actual permalink has all the text w/o 
the read-more link. Is there some documentation I am 
missing?' 
 
>>> art.added 
1063413726.152036 
 
>>> art.feed 
'http://roughingit.wari.org/comments.xml' 
 
>>> data.feeds[art.feed].title 
'RoughingIT recent comments' 

Nice, just about all that we need for the screensaver of ours. Our program is going to be very short indeed. What we are going to need are an HTML stripper, and a text wrapper.

And for rawdog, I'm going to get articles from it randomly. I don't care whether the items are new or old, it's just to fill up the screensaver. random.choice gives me what I need to select the articles:

>>> from random import choice 
>>> choice(data.articles.items())[1].title 
'Goodbye Euro-centric, hello America-centric' 
>>> choice(data.articles.items())[1].title 
"S'pore delegation in Brunei for visit" 
>>> choice(data.articles.items())[1].title 
'3-D with animated gifs, who knew?' 
>>> choice(data.articles.items())[1].title 
'Their 22nd birthday was surely one to remember' 
>>> choice(data.articles.items())[1].title 
'Fame vs Fortune: Micropayments and Free Content' 

The HTML stripper that I'm using is actually found in comp.lang.python, it's a simple code that seems to do some magic, so that will take a while to explain I guess. But it is important to strip out the html because the screensaver program does not read html. As for text wrapping, it's a python2.3 thing only though. But you can find a backport for older pythons [WWW]here.

So, in about 10-20 minutes, I'm able to whip out a program that outputs this:

bash-2.05$ ./get_story.py 
14 hours ago: 
From: STI Singapore 
Title: Should cops have done more to warn of teen rapist? 
Link: http://straitstimes.asia1.com.sg/storyprintfriendly/0,1887,209707,00.html? 
 
A TEENAGE rapist and molester was on the loose in 
the neighbourhood. He preyed on 11 girls, from 
eight years old to 12, over six weeks before he 
was caught. 

According to [WWW]tripps, it's like Pointcast, only more pointless :)

The source can be downloaded @ http://roughingit.wari.org/sourcecode/get_story.py or you can view it here:

    1 import random
    2 import cPickle
    3 import sgmllib
    4 import string
    5 from textwrap import wrap
    6 from time import time
    7 
    8 # Change this to your actual rawdog statefile
    9 statefile = '/home/wari/.rawdog/state'
   10 # Depending on your screensaver, you might want to change this.
   11 # 50 is good for a scale of 3 in the phosphor screensaver
   12 wrapping_at = 50
   13 
   14 class Stripper(sgmllib.SGMLParser):
   15     """
   16     Strips HTML
   17     
   18     An SGMLParser subclass to strip away HTMLs
   19     """
   20     def __init__(self):
   21         self.data = []
   22         sgmllib.SGMLParser.__init__(self)
   23     def unknown_starttag(self, tag, attrs): self.data.append(" ")
   24     def unknown_endtag(self, tag): self.data.append(" ")
   25     def handle_data(self, data): self.data.append(data)
   26     def gettext(self):
   27         text = string.join(self.data, "")
   28         return string.join(string.split(text)) # normalize whitespace
   29 
   30 def striphtml(text):
   31     """
   32     Uses the stripper to weed out html
   33     """
   34     s = Stripper()
   35     s.feed(text)
   36     return s.gettext()
   37 
   38 if __name__ == '__main__':
   39     now = int(time())
   40     # Read rawdog's statefile
   41     fp = file(statefile)
   42     data = cPickle.load(fp)
   43     fp.close()
   44 
   45     item = data.articles[random.choice(list(data.articles))]
   46 
   47     delta = now - item.added
   48     unit = 'secs'
   49 
   50     # Damn, I'm lazy :) My sucky version of a fuzzy clock
   51     if delta > 60:
   52         delta = delta / 60
   53         unit = 'mins'
   54     if delta > 60:
   55         delta = delta / 60
   56         unit = 'hours'
   57     if delta > 24:
   58         delta = delta / 24
   59         unit = 'days'
   60     if int(delta) == 1:
   61         unit = unit[:-1] # Removes the (s) from the units
   62 
   63     print '%d %s ago:' % (delta, unit)
   64     print 'From:', data.feeds[item.feed].title
   65     print 'Title:', item.title
   66     print 'Link:', item.link
   67     if item.description:
   68         print
   69         print '\n'.join(wrap(striphtml(item.description), wrapping_at))

#Introspecting pickles
Posted by MALIK ABDUL WAHAB at Tue Oct 21 23:15:22 2003

SIR I NEED A JOB
IM A POOR MAN
I HAVE ONE MOTHER AND FATHER
AND TWO SONS AND FOUR DAUGHTERS
#Introspecting pickles
Posted by amnesia at Tue Nov 4 20:13:15 2003

haha. what the... you have a blog spam?
#Introspecting pickles
Posted by Chris Davies at Thu Mar 11 02:09:39 2004

Since the phosphor screensaver is now a terminal emulator to all intents and purposes, it might be easier to just run links displaying your rawdog in phosphor :)



Add Comment

Add a comment here:

You can use some HTML tags in the comment text:
To insert a URI, just type it -- no need to write an anchor tag.
Allowable html tags are: <a href>, <em>, <i>, <b>, <blockquote>, <br/>, <p>, <code>, <pre>, <cite>, <sub> and <sup>.

You can also use some Wiki style:
URI => [uri title]
<em> => _emphasized text_
<b> => *bold text*

Name:


E-mail:


URL:


Comment:


Remember info?