- Remote Access Infrastructure for Register Data

RAIRD White Paper - High Level Description

Version 0.6 - 2015-11-27

Statistics Norway (SSB) and NSD - Norwegian Centre for Research Data are in the process of establishing a national research infrastructure providing easy access to large amounts of rich high-quality statistical data for scientific research while at the same time managing statistical confidentiality and protecting the integrity of the data subjects.

RAIRD Overview

The purpose of RAIRD

RAIRD is a platform for research on data from Norwegian administrative registers. The intended content will be based on the copies of these data developed for statistical purposes by Statistics Norway. Such data are of great value for research purposes, but access is restricted because of (in particular) confidentiality problems related to direct data access.

The main ambition of RAIRD is to significantly reduce barriers for conducting safe research on register data, in order to increase volume and quality of research on such data.

To reach the goal, RAIRD must:

  • Be user friendly and sufficiently feature-rich
  • Preserve privacy of data subjects
  • Be cost effective for users and data curators
  • Complement existing access models

RAIRD puts the researcher in the driving seat and supports exploration and analysis of the full variable catalog of available data. This is made possible by automated processes guided by metadata, novel ways of organizing data and a privacy preserving architecture.

Access to data is provided through a metadata-rich interface with a large set of tools for data management, transformation and analysis. The underlying data are the statistical version of the original, unmodified register data, analytical and other outputs from the system are anonymous.

RAIRD - the research platform

Register data largely contain event histories, and therefore have an inherent temporal component that traditional statistical packages and supporting metadata resources haven’t been able to handle sufficiently well.

Through the RAIRD-project, NSD and SSB have developed generic event-oriented data- and metadata-models. A set of novel and cooperating software components simplifies interaction with and management of complex data sets for researchers. The RAIRD solution has three main components - the RAIRD DataStore (RDS), the RAIRD Remote Storage and Statistical Execution Platform (RSSP) and the RAIRD Online Statistical Environment (ROSE).

The RAIRD DataStore

A RAIRD DataStore builds on the ingest and quality appraisal processes in SSB. After a documentation and restructuring process, data and metadata are ingested into the RAIRD DataStore without any loss of information.

To preserve full data fidelity and utility, the RAIRD DataStore contains complete raw register data; data itself are not subjected to anonymization. Instead, the analytic procedures have built-in data protection and the analytical output is subject to automated disclosure control.

The DataStore not only supports event-history data - but also panel-, cross-sectional and aggregate data.

Data and metadata in the DataStore are subject to complete revision control in order to support reproducible research and data citation.

The RAIRD Remote Storage and Statistical Execution Platform

The DataStore is one of several components in the Remote Storage and Statistical Execution Platform. All data copying, -merging, -transformation and -analysis is executed within the RSSP on behalf of the researcher.

All execution is orchestrated from the Online Statistical Environment (ROSE) in the form of user actions that are sent to the RSSP where the actual data processing takes place. Outputs go through automated disclosure control before they are returned to the ROSE for user consumption.

The RSSP keeps track of all user activity, and lets researchers build data sets and statistical programs (scripts) and group them together in named workspaces. To handle potentially substantial workloads, the RSSP has been designed from the ground up for performance and scalability.

The RSSP activity recording serves two purposes:

  1. It provides audit trails for the system administrators
  2. It enables revision control for the researcher’s own work

Revision control on the DataStore as well as all activities and user-data within the RSSP furthermore supports both data citation and reproducibility of research conducted within RAIRD.

The RAIRD Online Statistical Environment

The functionality of the Online Statistical Environment is similar to the functionality in traditional statistical packages, with the following exceptions:

  • It is web-based and may be used from any location and any modern web browser
  • It supports searching, browsing, exploration of the complete catalogue of DataStores
  • It allows fine grained import of both status and event variables from DataStores into the user workspaces in the RSSP
  • The state of the workspace (scripts, commands and all working files) are auto-saved, preserved and recreated the next time the user logs into the system
  • Metadata (codebooks, definitions, search, auto-suggest, etc) are fully integrated with the data and easily available in the web interface
  • To prevent statistical disclosure, direct visual access to data is impossible. Users interact with data through metadata and analytical exploration. Analytical and transformational actions are sent to the RSSP which handles the execution and passes safe, anonymous outputs back to the user interface where they are rendered
  • Channels for both technical and data-related support are integrated

Quick access and other benefits for researchers

The most important goal of RAIRD is to increase research on register data by making the data easily available without compromising data privacy.

The guiding principle in the development of RAIRD is to support quick and easy access to data, and to add as much value and functionality as possible within a privacy preserving platform.

The underlying data and metadata structures and the surrounding technological solutions are designed to support interactive data exploration, transformation, analysis and other relevant workflow processes.