- Remote Access Infrastructure for Register Data

RAIRD White Paper - Detailed Description

Version 0.5 - 2015-12-15

RAIRD - Remote Access Infrastructure for Register Data

Statistics Norway (SN) and NSD - Norwegian Centre for Research Data are in the process of establishing a national research infrastructure providing easy access to large amounts of rich high-quality statistical data for scientific research while at the same time managing statistical confidentiality and protecting the integrity of the data subjects.

RAIRD Overview

Figure 1 - Overview of the RAIRD project

The purpose of RAIRD

RAIRD is a platform for research on data from Norwegian administrative registers. The intended content will be based on the copies of these data developed for statistical purposes by Statistics Norway. Such data are of great value for research purposes, but access is restricted because of (in particular) confidentiality problems related to direct data access.

The main ambition of RAIRD is to significantly reduce barriers for conducting safe research on register data, in order to increase volume and quality of research on such data. As opposed to most other solutions of its kind, RAIRD is not building a traditional Remote Access system where users typically see and work with prepared data sets through a Remote Desktop-based solution.

Instead the focus is to build a sophisticated and highly interactive version of Remote Execution, where users interact with data through a metadata-rich and powerful web application that supports script-based data transformations and analysis - but that prevents direct and potentially disclosive access to microdata.

RAIRD - the research platform

Register data contain some status data, but the main part is event histories, and therefore have an inherent temporal component that traditional statistical packages and supporting metadata resources haven’t been able to handle sufficiently well.

Through the RAIRD-project, NSD and SN have developed generic event-oriented data- and metadata-models. A set of novel and cooperating software components simplifies interaction with and management of complex data sets for researchers. The RAIRD solution has three main components - the RAIRD DataStore (RDS), the RAIRD Remote Storage and Statistical Execution Platform (RSSP) and the RAIRD Online Statistical Environment (ROSE).

Technology development is not the primary goal of the project, but a means to an end. The technological landscape is changing rapidly. Advances in hardware capacity and platforms and programming paradigms now allow for solutions that were unimaginable only a few years back.

The RAIRD technology takes advantage of the state of the art in technology, and uses this to mitigate and remove weaknesses in currently used data production and dissemination systems. Furthermore, it creates an extensible platform that enables adding new functionality and supporting new requirements in a cost-effective manner. In addition to supporting functional requirements, the platform is designed for scalability to handle variable and potential high workloads.

The RAIRD DataStore

The contents of a RAIRD DataStore (RDS) builds on the ingest and quality appraisal processes carried out by Statistics Norway. SN receives the original data from a variety of data owners/producers and potentially in a variety of formats and SN quality checks and integrates the data for its statistical production processes. After a further documentation and restructuring process, data and metadata are ingested into the RDS. The RDS exists in a secure area at Statistics Norway and cannot be accessed directly by researchers.

When the user is authorized to access RAIRD data, a user workspace is created within the same secure area. The user operates within her workspace and is allowed to transfer data between the DataStore and the workspace to build and prepare her own analytical data sets. The workspace is part of the RAIRD Remote Storage and Statistical Execution Platform (RSSP), and will be covered in further detail in a separate section on the RSSP below.

Handling complex data with full flexibility

A fundamental property of the RDS is its support for handling and documentation of temporal data (i.e. event-data/spell-data) where there are multiple observations for each subject with regular or irregular frequency.

Figure 2 - Marital status over time for three subjects

Figure 2 illustrates an example of a variable (Marital status) with a temporal component.

The RDS stores and represents data in a manner optimized for flexibility in terms of how data can be queried/extracted/retrieved by consuming applications. An important benefit is that it allows data and metadata to be retrieved by sending declarative queries to the RDS.

The queries declare the variables of interest and, if the variable has a time component, establish the distinction between single time points or time intervals. Figure 3 below illustrates the logic of such time-based variable queries. A query that specifies a single time point will extract the state of the population at the given timepoint. A query with a time period will extract the subset of the events that fall within its limits, including the temporal data necessary to distinguish the events in analysis.

Figure 3 - Two types of queries with single time point and time period

The illustrations above show examples of sequential events, where a given subject (e.g. a person) can only have one valid state at a time, i.e. the variable values are mutually exclusive. However, not all person-related events are sequential. A given person-subject may e.g. have multiple jobs at a time, similar events to some degree overlap. In RDS, this phenomenon is handled by an “event-normalization” where event data are broken down into structures that are sequential with mutually exclusive states.

Figure 4 below illustrates how concurrent jobs for different subjects may be structured.

Figure 4 - Illustration of the representation of concurrent jobs for different subjects.

Each job (defined as a uniquely identified time-bound relationship between a person and a company) is represented as a unit of its own with state progression over time. (The relationship between a job and its associated company is omitted from the illustration for clarity.)

Every single data point within the RDS is contextualized with temporal information as well as associations to related units (e.g. individuals make up families, work for companies, etc). The query outputs will contain all this and other information needed to manage, merge and restructure the extracted data with confidence. The actual data management and merging takes place in the RSSP covered further down.

Bridging the “Metadata Gap”

For historical reasons, statistical packages and database systems commonly used to store data sets, do not handle metadata (data documentation, definitions, codes and classifications) sufficiently well. This is illustrated in Figure 5 below. As a consequence, it becomes difficult to maintain integrity between data and metadata and metadata typically need to be recreated and reconstructed manually and frequently. The lack of integration also makes it difficult to create truly “metadata-driven” and truly interactive and informative solutions and interfaces.

In the RAIRD DataStore, metadata and data are fully integrated. Applications and systems that communicate with the RDS can rely on metadata to navigate the information space, and to formulate extraction procedures, data transformations as well as holistic quality assertions.

Figure 5 - The Metadata Gap and the RAIRD DataStore

Data/metadata integration mitigates the need for frequent out-of-band recreation of metadata, something that can be both time-consuming and error-prone. Quality and validity of metadata in the RDS may at any time be assessed through automated processes.

Multilingual Metadata

Full internationalization (multilingual support) is built in as part of the foundation of the RDS. All strings and texts (labels, descriptions, definitions, explanations, concepts, keywords, etc) can therefore be expressed in multiple languages with no additional complexity.

A Standards Based Approach

Standardization of metadata has been on the agenda for many years both in the community of statistical institutes and in the community of data archives. Important outputs include the General Statistical Information Model (GSIM) and the Data Documentation Initiative-standard (DDI).

GSIM and DDI are themselves fairly well-aligned, and the RAIRD DataStore is aligned with both. The development of RDS is done in ongoing close communication with both the GSIM- and the DDI-communities.

The rationale for a standards-based approach is many-faceted, and allows for common terminology and interoperability between and within institutions. Over time this will increase quality, consistency and comparability in data.

The use of standards doesn’t necessarily pay off within the boundaries of one single project. RAIRD has however benefited greatly from taking a standards-based approach from an early stage. Building on the domain knowledge baked into the aforementioned standards, the data- and metadata models in RAIRD are clear, succinct and semantically sound - and they allow for fine-grained and holistic integration between data and metadata.

Complete revision control of data and metadata

Fine grained revision control of data and metadata elements is a fundamental part of the RDS. This allows data curators to edit data and metadata without affecting reproducibility of research outputs and other data applications. Current and historic revisions are equally accessible for both data curators and data consumers. Complete audit trails are easily generated.

The revision control is enabled by the use of immutable/append-only data platforms combined with standards-based metadata models.

The RAIRD Remote Storage and Statistical Execution Platform

The RAIRD DataStore (RDS) is a key part of the Remote Storage and Statistical Execution Platform (RSSP) - but there are also other important components.

The RDS represents a read-only data store with a simplistic query interface for extracting data and metadata. The query outputs, on the other hand, are stored and manageable in the user’s workspaces in the RSSP.

Figure 5 - Client initiated queries and data extraction to User Workspace

Figure 5 above shows how client requests expressed by researchers are converted into RDS queries, and how query outputs are routed to the user’s workspace on a Storage Area Network (SAN) by a component called StatServer.

The StatServer

The StatServer is a completely stateless service that handles all input to and output from the User Workspaces. The component is built using Python/Cython, and leverages the rich and performant statistical capabilities available on that platform to execute analysis as well as all data management and transformation tasks on data residing in the Workspaces.

Data transformations

Data transformation procedures include: dropping/keeping records, dropping/keeping variables, reshaping, merging, etc. Upon such requests, the StatServer reads data from the User Workspace, applies the transformations, and writes the transformed data back to the Workspace. The sequence is illustrated below.

Figure 6 - How data transformation requests are handled

Data analysis

Analytical requests are also handled by the StatServer. Here, the StatServer runs the analysis and returns the provisional analytical output to downstream processes orchestrated by the dispatcher.

Figure 7 - How analytical requests are handled

Corrective actions

For many types of analysis, the provisional output cannot be returned unmodified to the client without disclosure risk. In such cases, a more complex sequence involving more components is invoked. The figure 8 below illustrates how a tabulation request gets handled.

Figure 8 - Corrective actions in a tabulation scenario

The RAIRD Online Statistical Environment

Since users cannot see or inspect data directly in RAIRD, a range of measures has been taken in order to compensate for this limitation.

The Online Statistical Environment mimics functionality from statistical packages in a web application, but adds metadata and available user-interface-patterns to enrich the user-experience as much as possible. As one would expect, the ROSE has a command window and a script editor, output panels and variable lists. Continuously synchronized variable metadata, dataset “inventory” reports and structural information are easily accessible aides that can help users verify and understand the state of their working data without direct inspection possibilities.

The runtime of the ROSE is the browser + Javascript, and it is developed using ClojureScript, Om and a small selection of additional Javascript libraries. Om is a Clojurescript-wrapper of the React-library (developed by Facebook) that uses immutable data structures. The use of this technology improves performance, and has other important benefits as well.

Figure 9 - Screenshot of an early version of the Online Statistical Environment.

The RSSP keeps track of all user activity, and lets researchers build data sets and statistical programs (scripts) and group them together in named workspaces. To handle potentially substantial workloads, the RSSP has been designed from the ground up for performance and scalability.

The RSSP activity recording serves two purposes:

  1. It provides audit trails for the system administrators
  2. It enables revision control for the researcher’s own work

Revision control on the DataStore as well as all activities and user-data within the RSSP furthermore supports both data citation and reproducibility of research conducted within RAIRD.

The functionality of the Online Statistical Environment is similar to the functionality in traditional statistical packages, with the following exceptions:

  • It is web-based and may be used from any location and any modern web browser
  • It supports searching, browsing, exploration of the complete catalogue of DataStores
  • It allows fine grained import of both status and event variables from DataStores into the user workspaces in the RSSP
  • The state of the workspace (scripts, commands and all working files) are auto-saved, preserved and recreated the next time the user logs into the system
  • Metadata (codebooks, definitions, search, auto-suggest, etc) are fully integrated with the data and easily available in the web interface
  • To prevent statistical disclosure, direct visual access to data is impossible. Users interact with data through metadata and analytical exploration. Analytical and transformational actions are sent to the RSSP which handles the execution and passes safe, anonymous outputs back to the user interface where they are rendered
  • Channels for both technical and data-related support are integrated

Quick access and other benefits for researchers

The most important goal of RAIRD is to increase research on register data by making the data easily available without compromising data privacy.

The guiding principle in the development of RAIRD is to support quick and easy access to data, and to add as much value and functionality as possible within a privacy preserving platform.

The underlying data and metadata structures and the surrounding technological solutions are designed to support interactive data exploration, transformation, analysis and other relevant workflow processes.