Data quality: or why I hate DiscogsAt least, that's the theory. In practice it turns out that Cory Doctorow was right: the data quality for the releases in the catalogue varies a lot. For some releases every tiny little detail has been described and there are clear pictures, for other releases you get a catalogue number and possibly the right country (if you're lucky!). Some of this bad data has been in there for many years and no one is fixing it, even though for some releases over 100 people have indicated they have it. This is what I hate about Discogs a lot.
Discogs marketplaceThis wouldn't be such a big problem (just annoying) if it wasn't for the Discogs marketplace. Discogs isn't just a catalogue, but it also has an associated marketplace where people buy and sell music items (and which apparently has taken away a lot of the market from eBay when it comes to vinyl records). Sellers can browse the catalogue for the item they want to sell and then indicate that they have a copy of that item for sale. If it isn't in the catalogue they add it first and then offer it for sale. But this is unfortunately not what happens. Too often I see that a seller has indicated a certain item is for sale, but then in the description it turns out that it is a different item. For example for a Peruvian item that I saw offered for sale the seller said:
"my copy is from Argentina"
but the catalogue did not contain information about the pressing from Argentina. This is against the Discogs terms of service, which say:
"Discogs allows for the submission of all unique versions of a release to the database. This means that all items listed for sale in the Marketplace must correspond with the correct release in the database. If the correct release does not yet exist on Discogs it must be submitted to the database before it can be sold. Commenting on the differences between the item being sold and the one detailed on the Discogs release page is not permitted."
There are literally thousands and thousands of items for sale that are not corresponding with the correct release in the database.
Discogs actually offers a way to flag this so they can take action but punishing sellers is not really in their own interest: Discogs takes a cut for every sale that is done through Discogs. In fact, that is their business model! Every sale means money for Discogs so I understand them, but for people who expect to buy a certain item, and then don't get the right item as advertized, or who miss out on an item because it was not added to the database it can be very sour.
I don't blame the sellers: at the moment there are still lots of releases missing from Discogs and adding a release is a lot of work if you want to do it right. If you have thousands of items for sale it is very time consuming check the releases and add ones that are missing.
Using data: or why I love DiscogsAs said I really hate the data quality issues in Discogs, but at the same time the site has a lot of potential. There are a few ways to react to this:
- I could get angry
- I could ignore it and hope someone will fix the problems with the data
- I could fix the data
Maybe foolishly I have decided to take that last option and to fix the data wherever I can. Luckily, and this is why I love them, Discogs is making this quite easy by making the data of the website (apart from the pictures, user data and sales data) available every month in a set of XML files under the CC0 license to basically do whatever you want with it.
Using some scripts and domain specific knowledge it is quite trivial to flag where entries have problems. This is what this blog will be about. The data in the Discogs database has enormous potential, we just need to unlock it.
The next few posts in this blog will be about the Discogs XML data and what can be done to discover errors in the data so they can be fixed.