Too many CDs! Going Digital with a Classical Collection 

Over the years, I've acquired too many CDs to store sensibly. The time has come to digitise the lot onto a hard drive, with the aim of ending up with uncluttered shelves and small neat box containing my entire collection. However, to do this properly there are some challenges to be faced:

Classical Music Challenges

The digital world is not well set up for classical music. Two problems in particular need solving. One is the classification system of album/artist/song propagated by the pop music hegemony, which doesn't map onto classical content; another is the inability of many digital audio players to play tracks "gaplessly".

(As an aside, the huge impedance mismatch between classical content and a world set up for digital pop music is illustrated by Universal Classics' own audio shop. If you purchase an opera there, you need to buy and - with many clicks - acquire a licence for every single one of the "tracks" in that opera. Since operatic acts are usually split into dozens of tracks this is just intolerable.)

Ars Longa

At a rate of one disc a day it's going over 3 years to rip this collection onto CD. It's a salutary thought that it'll take the same amount of time to listen to it as well!

Audio Fidelity

MP3 is audibly compromised (at least at higher compression settings), adding strange twangling and burbling noises to transients (plucked string suffer in particular). Even if less-compressed MP3 sounds better, I'm uneasy about losing any information on these discs and so will prefer a lossless format. FLAC works well for me, and has the interesting feature of supporting the embedding of arbitrary metadata (of which, more later ...)


Having all this content one one hard drive makes it awfully vulnerable. How to back up? Will burning DVDs be safe, or will these suffer from deterioration in the same way that some of my CDs have? Will I need to box all the CDs and store them in the attic? Or will the 'net have evolved to the point where I'll be able to drag and drop files for a speedy transfer to some off-site location?
[ add comment ] [ 0 trackbacks ] permalink
ISO SC34 meetings, Oslo 

To the beautiful city of Oslo to attend the first SC 34 meeting of the year, and in particular to progress DSDL. SC34 suddenly has a lot of new P member countries ('participating member' countries) sending representatives, and I am interested to note the majority of their representatives are, as individuals, also Microsoft employees. However, thoughts of Office Open XML (OOXML) are far from my mind, and far from the agenda for this meeting, a point strongly made by our secretary who emphasised that OOXML was outside the bounds of what we were going to be discussing.

For the breakout sessions, in the (unusually busy) WG1 meeting it was a pleasure to have new member Mohamed Zergaoui in attendance, whose thoughtful and expert comments were extremely helpful in identifying and sorting out some of the gnarlier issues in DTLL, as it enters the home straight and its, somewhat delayed, FCD stage.

Unfortunately pressures of work (the same pressures which have held up my editing of DTLL) means I return to the UK tomorrow - happy, however in the knowledge that the DSDL technical work is progressing and gaining more traction. The interest from implementors and commentators in NVDL, for example, has been such that a number of esoteric (and some less esoteric) problems with the standard have come to light, necessitating production of a technical corrigendum.
[ add comment ] [ 0 trackbacks ] permalink
Bulk Programmatic Conversion of Photos from flickr to SmugMug 

Having moved from flickr to SmugMug I was faced with the prospect of moving all my flickr photos across. A quick look at flickr reveals there is no easy way to get your photos back (not even a "Zip & Download" for galleries), so faced with the prospect of a weekend of picture-by-picture clicking I decided to get my stuff moved across in bulk programatically. Here's how I did it ...

General Approach

I wanted to preserve the titles of each of the pictures, but I was happy to dump everything from flickr into a single 'import' gallery on SmugMug and then use SmugMug's civilised organising features to get everything into the galleries I wanted.

Looking at flickr's Terms of Use (which also points out that the terms change when moving to a Yahoo! Id) I notice that users "must not modify, adapt or hack" -- so thoughts of a grand adapting web app for flickr (with a 'copy to SmugMug' button) went by the wayside, and instead I opted for a command-line application.

As a programming language I chose Groovy, partly as learning a new programming language on the fly makes it interesting :-)


Groovy runs on the Java platform, but a couple of third-party libraries are helpful to keep things nice: the Jakarta Commons HttpClient provides an excellent higher-level abstraction for dealing with the network interaction, and Elliotte Rusty Harold's XOM is a lovely library for dealing with XML without too much syntactic cruft.

Logging In

First up, we need to get a couple of sessions going: one with flickr and one with SmugMug. Here's some code (I'm not posting the whole thing -- but am happy to share if anyone's interested. Oh, and I hope the variable names make it 'self documenting' ;-))

First, to log into flickr:
// Login to flickr by POSTing to the "old skool" form target
def post = new PostMethod( flickrLoginUrl )
post.addParameter( "email", flickrUsername )
post.addParameter( "password", flickrPassword )
def status = flickrClient.executeMethod( post )
println "Flickr login, status=$status"

And then, to SmugMug ...
// Login to SmugMug over SSL using their REST API
def smLoginUrl = this.smugMugApiUrlStub
+ "method=smugmug.login.withPassword"
+ "&APIKey=" + smugMugApiKey
+ "&EmailAddress=" + smugMugUsername
+ "&Password=" + smugMugPassword
def get = new GetMethod( smLoginUrl )
smClient.executeMethod( get )

It's interesting to note the difference of approach here. When using flickr, it appears the username and password are sent as plain text - yikes! SmugMug's rest API, OTOH, is exposed though an SSL layer.

Also note, that in best REST fashion, the login GET to SmugMug responds with an XML document. We can then use XOM to extract the SessionID from this response -- we'll need it later to interact with SmugMug.

// Build a XOM XML document from the returned bytes
def doc = new Builder( false ).
build( new ByteArrayInputStream( get.responseBody ) )

// Use XPath to get the SM SessionID
sessionId = doc.query( "/rsp/SessionID" ).get( 0 ).value
println "SmugMug session ID is " + sessionId

Enumerating the flickr photos

The approach here it to enumerate the sets, and then for each set to enumerate the photos. From the "Your sets" page, I was hoping to get hold of the XML and pull out the URLs for each of the sets' own pages.

It was here I got a nasty shock - the flickr pages aren't XHTML. They're not even (according to the W3C Markup Validation Service ) valid HTML!

So, time to observe, and scrape with a regexp:
* Given the bytes of a flickr "sets" page, this
* iterates over each set ...
static void processAllSets( byte[] page )
String s = new String( page )

def pat = /a class="Seta" href="([^"]+)" title="([^"]+)"/
def matcher = ( s =~ pat )
println "Found $matcher.count sets"

for( index in 0 .. matcher.count - 1 )
def setUrl = matcher[ index ][ 1 ]
def setTitle = matcher[ index ][ 2 ]
"Got set at $setUrl, entitled '$setTitle'. Processing ..."
processSet( setUrl, setTitle )

Note Groovy's nice syntax for handling regexps.

Now we've got a URL for each set we can (using a regexp again) get a URL for all the links to thumbnails on that set's page
* Given a flickr set, this iterates over each of the photos in it
static void processSet( String setUrl, String setTitle )
def get = new GetMethod( "" + setUrl )
def status = flickrClient.executeMethod( get )

println "Got setUrl, status $status"
String s = new String( get.responseBody )

// get the photo titles and thumbnail URLs
def pat = /title="([^"]+)" class="thumb_link"[^>]+><img src="([^"]+)/
def matcher = ( s =~ pat )
println "Found $matcher.count photos in set"

for( i in 0 .. matcher.count - 1 )
def photoTitle = matcher[ i ][ 1 ]
def photoThumbUrl = matcher[ i ][ 2 ]
println "Got thumbnail at $photoThumbUrl, for photo entitled '$photoTitle'. Processing ..."
processPhoto( photoThumbUrl, photoTitle )

With this we'll end up with the URL of each of our photo thumbnails, and its caption. But we don't want to upload thumbnails to SmugMug, but our full-size original pictures. Luckily, flickr appears to follow a naming convention so that by changing the "_s.jpg" to "_o.jpg" in our URLs, we can synthesise the URL of the original photo.
* Given a flickr photo, this copies it to SmugMug
static void processPhoto( String photoThumbUrl, String photoTitle )
def photoUrl = photoThumbUrl.replace( "_s.jpg", "_o.jpg" )
def get = new GetMethod( photoUrl )
def status = this.flickrClient.executeMethod( get )
def byte[] raw = get.responseBody

this gives us (in the byte array called raw) our original image data. Next we have to generate an MD5 checksum, as this is required by the SmugMug upload mechanism. It's here that having Java on tap comes in very handy ...
def md = MessageDigest.getInstance( "MD5" )
md.update( raw )
def digestBytes = md.digest()
def checksum = ""
for( index in 0 .. digestBytes.length - 1 )
checksum += Integer.
toString( ( digestBytes[ index ] & 0xff ) + 0x100, 16 ).
substring( 1 )
println "Content length is: "+ raw.length
println "MD5 is: " + checksum
println "Starting POST ..."</pre></div>

Finally, we're ready to do the upload. Again, the REST way allows us to do this by POSTING our raw image data to SmugMug with the headers correctly set:
def put = new PostMethod( smugMugUploadUrl )
put.addRequestHeader( "Content-Length", "" + raw.length )
put.addRequestHeader( "Content-MD5", checksum )
put.addRequestHeader( "X-Smug-SessionID", this.sessionId )
put.addRequestHeader( "X-Smug-Version", "1.1.1" )
put.addRequestHeader( "X-Smug-ResponseType", "REST" )
put.addRequestHeader( "X-Smug-AlbumID", this.smugMugUploadGalleryId )
put.addRequestHeader( "X-Smug-Caption", photoTitle )
put.setRequestBody( new ByteArrayInputStream( raw ) )

smClient.executeMethod( put )
println "POST complete"

Et voila! automatic transfer of images from flickr to SmugMug.

The transfer process takes a while, as every image needs to get downloaded to, and then uploaded from, the client machine. It would be nice if SmugMug allowed pictures to be uploaded by URL (thereby bypassing the need to route the data through client machines with their measly domestic bandwidth) -- but maybe this opens up too much opportunity for abuse.


This does the trick, but for flickr uses with larger collections (multi-page sets, etc), some more code will be required. It might well be worth investigating flickr's own API for a more robust approach ...

In general though, I wish flickr had provided a better way for getting photos back in bulk - it would have made life a lot easier.

[ 11 comments ] ( 159 views ) [ 0 trackbacks ] permalink
I don't want no stinkin' Yahoo! Id 

flickr has told all its users that they will shortly need a Yahoo! Id in order to access the service.

I've had Yahoo! Ids in my time, and my memory of them is not good. What's more, having forked out for a flickr Pro account a while back, this unilateral change of flickr's terms of service leaves a bad taste.

So, over to SmugMug. That's better...

The challenge now is whether it's possible to automate the transfer of my galleries from flickr to SmugMug. SmugMug exposes a REST API for uploading, but the flickr side exposes nothing official for downloading. Hmmm - with a bit of screen scraping and jiggery-pokery, I wonder whether this is possible ...

Away from the screen, I took my daughter to a party at Wicken Fen and the light was just amazing, instantly making me regret that I had not brought my camera. So, remembering Ken Rockwell's advice that the camera doesn't matter, I got some snaps with a mobile phone. Here's one hosted (as it happens) on SmugMug. As to the equipment not mattering, hmmm -

[ 6 comments ] ( 96 views ) [ 0 trackbacks ] permalink
ISO Meetings, Montr�al 

My arrival in Montreal was somewhat marred by the fact that British Airways managed to lose my luggage. Which on a direct flight from Heathrow takes some doing :-(

Fulminating at the BA desk ellicited a �35 ex-gratia payment, but with no fresh clothes, or sponge bag content, the beginning of this trip is somewhat grungy.

Despite this, excellent progress was made at today's WG1 meeting, where Jeni Tennison was able to attend to go through the latest draft of DTLL. With all substantive points now settled, all that remains is for me to prepare a revised document and we are on track for a final candidate draft (FCD) text.

There are two significant changes to DTLL as compared to the last draft.

The first is that the 1:1 relationship between a DTLL documents and a Namespace (for its declared datatypes) has been relaxed and brought into line with RELAX NG's more liberal approach. DTLL instances will now be able to declare a bunch of datatypes from different Namespaces.

The second is that when parsing values using regular expressions, DTLL processors no longer build a mini XML document behind the scenes, but instead merely a set of bound variables. This should make implementation somewhat simpler (though, having already done the work on this I felt - perhaps rather unreasonably - that this was a feature worth preserving).

During the lunch break I made a quick visit to a department store for fresh sets of clothes and toiletries, since the online baggage tracker revealed that my suitcase was still 'being traced'.

After lunch we discussed DSDL Parts 8, 7 and 9 � and our view is that now all of these texts will be nearing their final form in or before September 2006. So it looks likely a January WG1 meeting will be necessary to resolve ballot comments received and move them towards the the final stages of their standards status.
[ add comment ] [ 0 trackbacks ] permalink