The Global Hunger Index (GHI) is a tool adapted and further developed by the International Food Policy Research Institute (IFPRI) to comprehensively measure and track global hunger. It incorporates three interlinked hunger-related indicators - the proportion of under-nourished in the population, the prevalence of underweight in children, and the mortality rate of children - and provides a single GHI value for most countries.
Update - October 2011: Data for the 2010 and 2011 Global Hunger Index publications is now available to browse as Linked Data, and through Linked Data APIs on the Kasabi Platform.
The data dump for the 2010 GHI Publication linked above, and on the Kasabi linked data platform, has been updated to adopt a new structure, and to fix some errors of country mapping in the earlier version published on this site.
A pilot project took place in 2010 to to explore options for publishing the Global Hunger Index as Open Linked Data, using RDF standards. Publishing the GHI data as linked data allows it to become part of the growing web of linked data.
18th January 2011 - Initial RDF Models of GHI created for comment.
11th October 2011 - Revised models published on Kasabi Platform. SCOVO model and multi-measure data cube model deprecated.
Whereas, in the Excel spreadsheet version of GHI values, the semantics (meaning) of the data is contained within the layout, and in annotations presented visually (e.g. symbol alongside a column to indicate the value is an estimate; column headings spanning two columns), RDF representations of data should be 'machine-readable' and 'self-describing'. As machines cannot make sense of visual context, the context of the information must be encoded in the data. This involves choosing vocabularies and models to use in representing data. In this pilot project we have explore two possible models for representing statistical data.
In choosing how to model data a range of considerations are relevant, including:
In this pilot project two models were explored, based on two established vocabularies.
- The Statistical COre VOcabulary - http://purl.org/NET/scovo#
SCOVO is a simple model for recording statistics. The diagram below shows the basic SCOVO model, in which there are 'items' belonging to 'datasets' and which have 'dimensions' (e.g. Year; Country) etc.
Statistics are regarded conceptually as 'information about events in the world'. Values are recorded using the standard rdf:value property.
Dimensions must be resources (rather than strings of text).
The RDF Data Cube - http://purl.org/linked-data/cube#
The RDF Data Cube vocabulary emerged out of efforts to express statistics in a way that is compatible with the widely used Statistical Data and Meta-data Exchange language (SDMX) used by large international institutions including the World Bank, FAO, ILO and UN. As yet, converters between SDMX and RDF Data Cubes are not yet available - but may be anticipated.
The diagram below shows an overview of the RDF Data Cube model.
The model requires data to have a 'Data Structure Definition' which describes the data in a dataset. Each item of data in a Cube may have one or more dimensions, and one or more measures (vaues). The ability to have multiple measures allows, in the GHI case, supporting values and the GHI to be kept together, rather than expressed as separate statistics.
Further detail on each of the models is accessible from the tabs above. However, converting the GHI into RDF raised a number of issues shared between the two models (although solved differently in some cases).
GHI values refer to countries. There are a range of providers of country identifiers that can be referred to in a linked data dataset. Geonames.org
is one of the best established, having been an early mover in providing an RDF interface for it's data - and having an extensive database of
However, Geonames.org identifies places in its linked data server using it's own internal ID numbers which have to be discovered before linkages can be made. The data it return for a query is also limited unless the client accessing the data is aware of how to further query Geonames.
By contrast, the lesser known http://ontologi.es service provides URIs which return data based on ISO Country codes, themselves a widely adopted standard.
In both models, we have used Geonames as the primary URI to identify countries, but have included an rdfs:seeAlso statement to indicate the linkage between the Geonames identifier, and the ontologi.es identifier.
An interesting test-case is likely to emerging in the coming year in observing how different third-party URIs for countries handle identifiers for Southern Sudan - as ensuring the persistence of an identifier over time could be important for longer-term publishing of GHI and other IFPRI data.
The 2010 GHI is based on data collected between 2003 and 2008. The revised 1990 GHI is based on data from 1988 - 1992. We have already noted that
SCOVO treats statistics as 'events' - with the implicit assumption that a 2010 statistic is something 'true of the world in 2010'.
In so-far-as the GHI is a conceptual construct, we can accept that the 2010 GHI is true of 2010, and indicate it as such. However, as most date/time standards need at least a month and day as well as year, we have modelled the GHI as true of 1st October in each respective year (otherwise software which requires month and day to visualise it is likely to assume, in the absence of these values, 1st January).
The supporting data is more complicated - as different country's data may be based on different years, but this information is not contained in the GHI spreadsheet. In SCOVO, this has been resolved by creating a dimension resource for each type of supporting data, and each date range, and then using the time ontology (http://www.w3.org/2006/time#) to specify the start and end of the date range. In the Data Cube, as we have included supporting data as additional measures to the GHI (rather than expressing them as distinct statistics) their temporal relevance is expressed only in textual comments.
Missing values -
There are two types of missing value in the GHI dataset. Values which are strictly missing (could not be calculated) and GHI values of less than 5,
which are show in the spreadsheet as "<5".
Systems expecting a number for each statistical value may struggle to read a file with the string "<5" where they expect a numerical value.
SCOVO has no established way of handling missing values, so they have not been addressed in the SCOVO model. The 'value' of each statistic has been given as a string, rather than numerical data type to prevent the creation of data which would fail validation checks.
As the RDF Data Cube inherits many of the features of SDMX it does have an established code for indicating missing data. To distinguish where data is missing because it could not be calculated, and where data is missing because the value is "<5" we created our own missing value codes, related back to the relevant SDMX code (using rdfs:subPropertyOf). In this case we were then able to omit values of "<5" and represent all GHI values as numerical data types.
The GHI spreadsheet indicates clearly which supporting data values are based on estimated figures, and it was important to carry this information
across into the RDF data.
As each statistic is represented separately in SCOVO we can simply add a dimension indicating whether a value is an estimate or not.
As we are using multi-measure RDF Data Cube observations, we cannot ascribe properties to individual measures within individual cubes. Instead, we have used the 'slice' feature of the RDF Data Cube vocabulary which allows annotations to be applied to an arbitrary set of observations. These slices contain textual annotations explaining the nature of the estimates, taking text from the GHI report.
Google Refine works with two-dimensional spreadsheets of data, and includes features for reshaping the data using an internal scripting language, and for fetching additional data from web services that return JSON.
A set of actions to reshape and structure data, and define a mapping between the data and RDF can be captured and exported to be run again - but the conversion of the data is still a manual process. Alternative processes involve writing scripts which take an authoritative data source and convert it 'on the fly' or periodically.
The SCOVO Vocabulary emerged from work to model time-series data from European, US and UN statistical datasets (Hausenblas et al. 2009). It provides a very simple structure for recording 'items' each of which has a single value, and a number of dimensions.
In standard SCOVO, dimensions are indicated by a scovo:dimension property, which must point to a resource. The rdf:type of that resource is given, as a further resource which describes the dimension and is a sub-class of scovo:Dimension. The proposed SCOVOLink vocabulary (Vrandecic et al. 2010) proposes a model for replacing standard scovo:dimensions with custom properties, but this potentially increases the complexity of the dataset, so was not attempted in order to attempt a simple rendering of GHI data.
Our modelled SCOVO contains four dimensions:
The value of each statistic is given by 'rdf:value'. For reasons outlined under 'General Issues', values are not typed.
Whilst the data is not currently available in a SPARQL endpoint, the query below gives an example of the sort of query that could be used to return a flat structure of GHI statistics.
The following brief notes outline how Google Refine was used to generate SCOVO data from the GHI. More detailed notes on some of the processes used here are available alongside details of the creation of RDF Data Cube data.
As each SCOVO item will only contain one value, the first step is to flatten out the GHI Spreadsheet, so it contains only one value per row. This can be done by manually copying and pasting values to repeat country names, and adding a column to indicate which year is being referred to and which variable is being provided in that row. The first stage of this (at least ensuring there are separate rows for the 1990 and 2010 statistics for any given country) needs to be done before the data is imported into Google Refine. The rest can be completed in Refine.
Using the formula detailed for RDF DataCubes, we can fetch Geonames IDs and Country Codes for each country.
As year ranges are given in the dataset in abbreviated forms such as "1990-92" we need to expand out a start year and end year column that will make it easier to create date ranges.
We also need to generate columns with clear text values for whether or not a value is an estimate, and with convenient shortnames for each statistic variable.
Using the 'Edit RDF Skeleton' option we can then generate our data model, consisting of an scovo:Item for each row. The full schema Skeleton is shown below.
The final SCOVO file can be exported from Google Refine from the 'Export Menu'. An N3 version was generated using the cwm command line tool.
The Google Refine project file for conversation of GHI data to SCOVO is available here (tar.gz) and the exported list of instructions used to convert the data is available here (json). The flattened data spreadsheet is available here.
The RDF Data Cube vocabulary emerged out of efforts to express more complex statistics in RDF, notably statistical data currently stored and exchanged using the XML based 'Statistical Data and Metadata eXchange' standard SDMX.
The RDF Data Cube model introduces:
[Note: (Vrandecic et al. 2010) describe a way with SCOVOLink and OpenMath to express for each value in a dataset how it was generated. For example, showing for each observation that GHI = (PUN + CUW + CM)/3 as explained in the GHI report. This method is transferable to the Data Cube model, but has not been employed as no direct use-case was envisaged at this stage.]
1) Data Model
We first need to sketch out a data model. The diagram below is a non-exhaustive sketch of how observations, datasets, data structure definitions and slices are used to model each country and year's observations, as well as to capture annotations. It also highlights some of the areas where questions about the best modeling remain.
Whilst we could build all of our model using the Google Refine RDF add-on, this is likely to get complicated, so instead the data structure definition and other general information can be written by hand in N3 (or using any other RDF authoring tool). The resulting ghiqb.rdf file can be fed to Google Refine as the URI of a new prefix, allowing us to use all the properties specified in this file.
(Useful tools: The prefix.cc website provides useful shortcuts for writing RDF by hand by giving access to a list of common prefixes in common formats. Visitinghttp://prefix.cc/rdfs,rdf,qb,foaf,dcterms.n3 provides N3 prefix statements for all the RDF vocabularies in the URL Changing .n3 for .sparql, or adding and removing RDF vocabulary prefixes you can see how it gives quick access to commonly needed information.)
The extra below shows some of the ghiqb.rdffile.
As this file is currently in n3, and Google Refine prefers TTL or RDF/XML, we use the command line CWM tool to convert the file (and validate it in the process).
Ideally this file would then be published at its final URL to allow us to provide Google Refine with a new and correct prefix and have it fetch the properties in our file.
[Workaround: As during development we do not have access to the IFPRI server we have hosted the ghiqb.rdf file on a temporary server and we will have to run a search and replace on the generated data later to ensure our prefixes point to the correct URIs where data will in fact be hosted.]
2) Preparing the data
We are going to generate our RDF using Google Refine again. Instead of working from a hand-flattened spreadsheet, we will work from the original ghi2010-annexe-table.xls file available from http://www.ifpri.org/publication/2010-global-hunger-index
Because we are working with multi-measure observations we only need to carry out basic refactoring the ghi2010-annexe-table spreadsheet in Google Refine, renaming each column for simplicity of future access, and removing the first row (which contains additional descriptive information).
Our new list of column names is:
The ghi2010-annexe-table does not include Country Codes (in the SCOVO example we used The Guardian's spreadsheet which had these added), so we need to fetch both ISO Country Codes and GeoNames ID numbers for each country in our dataset. To do this we use the Google Refine 'Add column by fetching URLs' feature to fetch data from the GeoNames API.
Working against the 'Country' column we fetch data for each row into a new column called CountryData from the URL:
This adds a column with JSON data by fetching Political Entities matching the country name given from the GeoNames API. We can then extract fields from this data using 'Add column based on this column…' and using the parseJson function. We use this twice, first with the formula:
to generate a column called 'GeonamesID', and secondly with the formula:
to generate a column called 'CountryCode'.
Note: we are trusting that GeoNames has returned the best possible identifier for each country, and have not carried out manual checking. One error that has come to light during exploration is the resolution 'Congo, Rep' which was, by the geonames resolution identified as the DRC. This has been manually updated.
3) Modeling the data
With our renamed columns, geonames and country code data we can now build the RDF model.
Firstly, we need to add a number of key prefixes to our data:
And our custom vocabulary created earlier, ghiqb, from it's temporary URL:
Then we need to set the BaseURI. As before we'll suppose that our RDF will be published in a single file on the IFPRI Site, and so will set our BaseURI to http://data.ifpri.org/rdf/ghi/2010/qb/#
Two observations at once
As we have not restructured our input spreadsheet to have one row per year, per country, but we still want one observation per year, per country, we will be creating at least two root nodes - one for 1990 and one for 2010 obsevations.
Using the CountryCode as main value of the (row index) URI we use the expressions
respectively for two root notes, one for each year of data represented.
We can set the type of these root nodes to be qb:Observation.
We can then start adding properties (remove any existing). We have two dimensions to add:
We now add our measures to each observation.
We need to attach our observations to a data cube dataset, and to attach that dataset to the data structure definition that was handwritten earlier.
We will define the dataset inside the ghi2010qb.rdf file, so create a property for our observations of qb:dataset and add a constant value of '2010-GHI-Report' with type 'URI'.
We need to do this for both our observations, but we only need to add additional meta-data about the dataset for one.
The most important bit of meta-data is to link it to the data structure definition in ghiqb.rdf by adding a qb:structure property pointing to the temporary location of this file at http://practicalparticipation.dyndns.org/ghiqb.rdf#GHI-Report-dsd
It is useful to add further annotations of the dataset at this point also also, including labels and core meta-data.
In SCOVO we included a dimension for the level of accuracy of each statistic (based on the existence of an estimation symbol in the original table). A limitation of the multi-measure data cube model is that we cannot easily indicate which of our different measures is estimated.
We have therefore taken advantage of another feature of the data cube, slices, which allow annotation of an arbitrary set of observations with some custom notes. In our hand written ghiqb.rdf file we have specified two annotation slices:
We can add observations to these slices using qb:observation. A data cube aware tool could then query for any observation to see if there are slices that refer to it.
Now that our slices are created, and will have URIs of the form:
we can add extra root nodes to our RDF Schema Template on Google Refine and use a conditional statement to have all those rows with a * in the relevant estimate column added to these slices.
To do this we:
If the row is estimated, then the relevant observation URI will be added to the slice. If not, ghi2010qb.rdf#null will be added to the slice
This leads to the non-ideal situation where each slice contains the empty ghi2010qb.rdf#null, but this can be manually removed later if required.
We repeat the process for each slice we want to generate and add values to.
We should also look at add a license specification to the document to make clear what forms of re-use are permitted. Many in the open data community advocate specifying a public domain license to allow maximal re-use of data and combination with other datasets.
We can now export our completed file using the Export -> RDF as RDF/XML option in Google Refine. We need to search-and-replace the temporary URL of our DSD file with it's final URL before using cwm to convert our file to n3. A round-trip from Google Refine's RDF/XML to N3 and back again seems to generate tidier RDF/XML also.
The Google Refine project file for conversation of GHI data to RDF Data Cube is available here (tar.gz) and the exported list of instructions used to convert the data is available here (json). The original GHI spreadsheet is available here.
The modelled GHI data has been configured to be deployed on a plain old web-server using single files with fragment identifiers to deliver RDF/XML and N3. In future, SPARQL endpoints or other publishing options may be provided (e.g. improved URL structure to allow fetching of single observations rather than the full file).
To remove the need for additional server configuration, a simple folder structure with PHP used to redirect requests to folder to the correct file has been adopted. N3 files have been generated using cwm.
There are a number of known limitations to the current version of the RDF modelling.
For more information on the Global Hunger Index please see the IFPRI website.
For enquiries around this linked data demonstrator and IFPRI work on linked data, contact Chris Addison (C.Addison@cgiar.org)
The IKM Linked Info For Development discussion group (dgroups) provides a space for discussion of practical and theoretical issues relating to the use of linked data in development. The group was established following a workshop on Linked Data for Development which is detailed on the IKM Website.