PubChemRDF is Launched

Introducing PubChemRDF!

The PubChemRDF project encodes PubChem information using the Resource Description Framework (RDF).  One of the aims of the PubChemRDF project is to help researchers work with PubChem data on local computing resources using semantic web technologies.  Another aim is to harness ontological frameworks to help facilitate PubChem data sharing, analysis, and integration with resources external to the National Center for Biotechnology Information (NCBI) and across scientific domains.

What is RDF?

RDF stands for resource description framework and constitutes a family of World Wide Web Consortium (W3C) specifications for data interchange on the Web. RDF breaks down knowledge into machine readable discrete pieces, called “triples.” Each “triple” is organized as a trio of “subject-predicate-object.” For example, in the phrase “atorvastatin may treat hypercholesterolemia,” the subject is “atorvastatin,” the predicate is “may treat,” and the object is “hypercholesterolemia.” RDF uses a Uniform Resource Identifier (URI) to name each part of the “subject-predicate-object” triple. A URI looks just like a typical web URL.

RDF is a core part of semantic web standards.  As an extension of the existing World Wide Web, the semantic web attempts to make it easier for users to find, share, and combine information.  Semantic web leverages the following technologies: Extensible Markup Language (XML), which provides syntax for RDF; Web Ontology Language (OWL), which extends the ability of RDF to encode information; Resource Description Framework (RDF), which expresses knowledge; and RDF query language (SPARQL), which enables query and manipulation of RDF content.

How can PubChemRDF help your research?

PubChem users have frequently expressed interest in having a downloadable, schema-less database. PubChemRDF enables the NoSQL database access and query of PubChem databases.  Using PubChemRDF, one can download the desired RDF formatted data files from the PubChem FTP site, import them into a triplestore, and query using a SPARQL query interface. There are a number of open-source or commercial triplestores, such as Apache Jena TDB and OpenLink Virtuoso (a list of triplestores can be found here: http://en.wikipedia.org/wiki/Triplestore). Other than triplestores, PubChemRDF data can also be loaded into RDF-aware graph databases such as Neo4j, and the graph traversal algorithms can be used to query the RDF graphs. At last but not least, the ontological representation of PubChem knowledge base allows logical inference, such as forward/backward chaining.

The RDF data on the PubChem FTP site is arranged in such a way that you only need to download the type of information in which you are interested, so you can avoid downloading parts of PubChem data you will not use.  For example, if you are just interested in computed chemical properties, you only need to download PubChemRDF data in compound descriptor subdomain. In addition to bulk download, PubChemRDF also provides programmatic data access through REST-full interface.

Where can you learn more about this?

To get an overview of the PubChemRDF project, please view this presentation.  To learn more about detailed aspects of PubChemRDF and how to use it, please view this presentation. The PubChemRDF Release Notes provide additional technical information about the project.

Additional blog posts will follow on PubChemRDF project topics, including: the FTP site layout, the REST-full interface, and ways to utilize PubChemRDF for research purposes including using SPARQL queries.