Home Data Variables Samples Documentation Search Minnesota Population Center
Frequently Asked Questions (FAQ)

General information about the project
    What is IPUMS-USA?
    What's in the future for IPUMS?
    Does IPUMS-USA add value to the data?
Getting started
    Where should a new user start?
    How do I get access to IPUMS-USA data?
Basic concepts
    What are microdata?
    What are "pointer variables"?
    What are "general" and "detailed" versions of variables?
    What are "weights"?
    What does "universe" mean in the variable descriptions?
Getting data
    How do I obtain data?
    What format are the data in?
    How long does a data extract take?
    What if the samples are too big for me to handle?
    What is "case selection"?
    Why can't I open the data file?
    Is there a preferred statistical package for using the IPUMS?
    Can I analyze IPUMS-USA data without a statistical package?
    Can I get the original data?
    How is a record uniquely identified?
Using IPUMS data
    Are there tricky aspects of IPUMS data to be particularly aware of?
    What are the major limitations of the data?
    Can I find particular individuals in the IPUMS data?
    How do I cite IPUMS-USA?
    Can I use IPUMS for genealogy?

General information about the project

What is IPUMS-USA?      [return to top]

The Integrated Public Use Microdata Series (IPUMS-USA) consists of thirty-nine high-precision samples of the American population drawn from fifteen federal censuses and from the American Community Surveys of 2000-2006. Some of these samples have existed for years, and others were created specifically for this database. The thirty-nine samples, which draw on every surviving census from 1850-2000, and the 2000-2006 ACS samples, collectively comprise our richest source of quantitative information on long-term changes in the American population. However, because different investigators created these samples at different times, they employed a wide variety of record layouts, coding schemes, and documentation. This has complicated efforts to use them to study change over time. The IPUMS assigns uniform codes across all the samples and brings relevant documentation into a coherent form to facilitate analysis of social and economic change.

IPUMS is not a collection of compiled statistics; it is composed of microdata. Each record is a person, with all characteristics numerically coded. In most samples persons are organized into households, making it possible to study the characteristics of people in the context of their families or other co-residents. Because the data are individuals and not tables, researchers must use a statistical package to analyze the millions of records in the database. A data extraction system enables users to select only the samples and variables they require.

IPUMS-International is the world's largest collection of publicly available individual-level census data. IPUMS-International integrates samples from population censuses from around the world taken since 1960. Scholars interested only in the United States are better served using IPUMS-USA, which is optimized for U.S. research.

IPUMS-CPS is an integrated set of data from 46 years (1962-2007) of the March Current Population Survey (CPS). This harmonized dataset is also compatible with the data from the U.S. decennial censuses that are part of the IPUMS-USA. Researchers can take advantage of the relatively large sample size of IPUMS-USA at ten-year intervals and fill in information for the intervening years using IPUMS-CPS.

What's in the future for IPUMS?      [return to top]

IPUMS-USA is funded through 2012 by several grants from the National Institute of Child Health and Human Development. In addition to working on new features for the website and data extraction system, we are currently making new high-density samples for 1880, 1900, 1930, and 1960. Over the next four years we plan semi-annual data releases every March and October. The precise sample composition of each data release is hard to predict very far ahead of time, but our latest plans can be found on our data release schedule page.

We have every expectation of continuing the project beyond 2012, but will have to secure further funding as our current grants expire. To be successful, we need to have a large body of users and published works we can point to. Please inform us if you have any presentations or publications using IPUMS data.

Does IPUMS-USA add value to the data?      [return to top]

IPUMS data is integrated over time and across samples by assigning uniform codes to variables. This process itself adds value to the data by fully documenting all codes and compiling all variable documentation in a hyperlinked web format. But we do many other things as well:

IPUMS creates a consistent set of constructed variables on family interrelationships for all samples. The "pointer" variables indicate the location within the household of every person's mother, father, and spouse.

IPUMS data also includes harmonized income and occupation variables. The Census Bureau has reorganized its occupational and industrial classification systems in almost every census administered since 1850. Although IPUMS retains the original occupation and industry codes, a variety of occupation and industry variables have been created for long-term analysis. More information on these variables can be found on the Occupation and Industry Variables page.

Getting started

Where should a new user start?      [return to top]

The documentation is a natural starting place for new IPUMS-USA users. First, the User's Guide provides an overview of the database and detailed documentation on the variables and samples available.

The Variables page is the primary tool for exploring the contents of IPUMS-USA. On the variables page, clicking on a variable name brings up its documentation. It contains a description of the variable and discussions of comparability issues over time. The "enumeration text" link compiles all the questionnaire text and instructions pertaining to the census question for every sample. The variables page also has direct links to the codes page for each variable. The codes page shows the coding structure and labels for a variable and the availability of categories across samples. These categories can suggest the types of research possible with a given sample.

The Samples page describes the characteristics of the samples in the data series and the censuses from which they were derived.

If you are already registered to use IPUMS-USA, you can click on "create an extract" and use the data access system. To start, users can reference our instructions for the extraction system.

How do I get access to IPUMS-USA data?      [return to top]

Access to the documentation is freely available without restriction; however, users must register before extracting data from the website.

The IPUMS-USA data is also available for analysis online through the IPUMS Online Data Analysis System.

Basic concepts

What are microdata?      [return to top]

Census microdata are composed of individual records containing information collected on persons and households. The unit of observation is the individual. The responses of each person to the different census questions are recorded in separate variables.

Microdata stand in contrast to more familiar "summary" or "aggregate" data. Aggregate data are compiled statistics, such as a table of marital status by sex for some locality. There are no such tabular or summary statistics in the IPUMS data.

Microdata are inherently flexible. One need not depend on published statistics from a census that compiled the data in a certain way, if at all. Users can generate their own statistics from the data in any manner desired, including individual-level multivariate analyses.

See an image of IPUMS data here. All IPUMS data are in this general format.

What are "pointer variables"?      [return to top]

The IPUMS "pointer" variables indicate the location within the household of every person's mother, father, and spouse. Nearly all samples indicate the relationship of each person to the head of household, but it is much harder to relate individuals to persons other than the head (for example, grandchildren to children, sons-in-laws to daughters, or unrelated persons to each other). We have developed a complex core algorithm to make such connections, and we customize it as needed to account for peculiarities of specific samples. The pointer variables are called MOMLOC, POPLOC and SPLOC in the IPUMS system, and accompanying variables indicate the major rules under which a specific link was made.

The pointer variables make it easy to construct individual-level variables representing the characteristics of co-resident persons, such as occupation of spouse, age of mother, or educational attainment of father. You need to include the serial and person ID variables (SERIAL and PERNUM) in your extract, as well as the pointer variables themselves, to perform these data manipulations.

What are "general" and "detailed" versions of variables?      [return to top]

Most variables in the IPUMS have a composite coding structure, where the first digit is largely comparable across samples, and second and subsequent digits provide progressively more detail available in some samples and not others. For some highly requested variables, the composite coding structure is formally recognized in our system by distinguishing separate "general" and "detailed" versions of the variable. For example, researchers can access an internationally comparable 1-digit general version of "employment status", or they can use the fully detailed 3-digit version, if their research requires finer distinctions. The two sets of codes are completely consistent with one another; one simply provides more categories, while the other is simpler to use and usually more comparable across samples.

The variables with general versions have a checkbox in the "general version" column in the data extract variable selection screen. Other variables only have the default full-detail version. It is possible to include both the general and the detailed version of a variable in a data extract. Both versions of a variable come with appropriate syntax labels. In data extracts, the detailed version of the variable gets a "D" appended onto the end of its mnemonic (for example, for "marital status," EMPSTAT is general and EMPSTATD is detailed).

The general and detailed versions of a variable both correspond to the same description in the documentation system. The codes and frequencies of each version are viewable separately on the relevant variable codes page.

What are "weights"?      [return to top]

Many IPUMS samples are unweighted or "flat": every person in the sample data represents a fixed number of persons in the population, while others are weighted, with some records representing more cases than others. This means that persons and households with some characteristics are over-represented in the samples, while others are underrepresented. See the PERWT variable or What is IPUMS? page for a listing of the weighted IPUMS-USA samples.

To obtain representative statistics from the weighted samples, users must apply sample weights. Follow one of the following procedures:

1. For person-level analyses using a weighted sample, apply the PERWT variable. PERWT gives the population represented by each individual in the sample.

2. For household-level analyses using a weighted sample, weight the households using the HHWT variable. HHWT gives the number of households in the general population represented by each household in the sample.

Even the unweighted samples have values for HHWT and PERWT, but every record in those samples receives an identical weight. This allows the application of the weight variables in pooled extracts that contain both weighted and unweighted samples. Otherwise, the use of the weights is optional in the unweighted samples.

What does "universe" mean in the variable descriptions?      [return to top]

The universe is the population at risk of having a response for the variable in question. In most cases these are the households or persons to whom the census question was asked, as reflected on the census questionnaire. For example, children are not usually asked employment questions, and men and children are not asked fertility questions. Cases that are outside of the universe for a variable are labeled "NIU" on the codes page. Differences in a variable's universe across samples are a common data comparability issue.

The universes will not always be entirely clean of apparently erroneous cases. Some persons or households that should not have answered the question did, and some that should have answered may be included in the "NIU" (not in universe) category. But until we perform comprehensive data editing and allocation in the future, we do not know whether the variable in question is in error or whether the variables that define the universe (for example, age or employment status) are incorrect.

Getting data

How do I obtain data?      [return to top]

All IPUMS data are delivered through our data extraction system. Users select the variables and samples they are interested in, and the system creates a custom-made extract containing only this information. To start, users can reference our instructions for the data extraction system and instructions for opening an IPUMS extract on your computer.

Data are generated on our server. The system sends out an email message to the user when the extract is completed. The user must download the extract and analyze it on their local machine. Access to the documentation is freely available without restriction; however, users must register before extracting data from the website.

What format are the data in?      [return to top]

IPUMS produces fixed-column ASCII data. Data are entirely numeric. By default, the extraction system rectangularizes the data: that is, it puts household information on the person records and does not retain the households as separate records. No information is lost, and this is the format preferred by most researchers; however, the extraction system includes the option of hierarchical data or household record only data.

In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. SPSS, SAS, and Stata are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer.

A codebook file is also created with each extract. It records the characteristics of your extract and should be downloaded for record-keeping.

All data files are created in gzip compressed format. You must uncompress the file to analyze it. Most data compression utilities will handle the files.

How long does a data extract take?      [return to top]

The time needed to make an extract differs depending on the number and size of samples requested, whether case selection is performed, and the load on our server. Extracts can take from a few minutes to an hour or more. The system sends an email when the extract is completed, so there is no need to stay active on the IPUMS site while the extract is being made.

What if the samples are too big for me to handle?      [return to top]

It is possible to make samples that are extremely large. There are two ways to reduce file size. You can select fewer samples or variables; or you can use the case selection feature of the extract system to include only records with certain characteristics, such as females age 15 to 49. Simply selecting out cases is not always desirable, however, because you may want all the co-resident persons as well. Accordingly, the case selection function also lets you choose to include everyone living in a household with a person with the selected characteristics.

What is "case selection"?      [return to top]

The "case selection" feature of the data extract system allows users to limit their dataset to contain only records with certain characteristics, such as persons age 65 and older. Multiple variables can be used in combination during case selection. Selections for multiple variables are additive, each being implicitly connected by an "and" for processing purposes.

Simply extracting selected cases can be too crude, however, because you may need the persons who co-resided with your selected population. Accordingly, the case selection function also lets you choose to include everyone living in a household with a person with the selected characteristics.

Case selection is completely optional. It is invoked in the variable selection screen of the extract system. Mark the checkbox in the "case selection" column for each variable you wish to use to define your extract. (Only selected variables have the case selection functionality.) A subsequent "case selection" screen will let you specify the characteristics of the cases you wish to include in your extract.

Users should be careful with the case selection feature. It is possible to select a specific variable category (i.e., polygamous marriage) that does not exist across all the samples in your extract, thereby inadvertently excluding those samples from your dataset.

Why can't I open the data file?      [return to top]

There are two likely explanations:

1) The data produced by the extract system are gzipped (the file has a .gz extension). You must use a data compression utility to uncompress the file before you can analyze it.

2) You cannot open the data file directly with a statistical package. The file is a simple ASCII file, not a system file in the format of any statistical package. The extract system does, however, generate a syntax (set-up) file to read the ASCII file into your statistical package. You must download the syntax file along with the data file from our server, open the syntax file with your statistical package, and edit the path in the syntax file to point to the location of the data on your local computer. Now you are ready to read in the data.

Is there a preferred statistical package for using the IPUMS?      [return to top]

IPUMS supports SPSS, SAS and Stata. The system does not make data files in those formats, but does generate syntax files with which to read in the ASCII data.

Can I analyze IPUMS-USA data without a statistical package?      [return to top]

The IPUMS Online Data Analysis System allows users to analyze all IPUMS_USA samples online. The system performs a wide range of operations data based on specifications made by the user, from simple operations such as tabulations to advanced statistical analyses. Examples and screenshots are available on our short instructions page.

Can I get the original data?      [return to top]

For all census years from 1850 to 1950 (except for 1890, which was destroyed by fire), the original manuscript population schedules are preserved on microfilm at the National Archives in Washington D.C. In each year, the microfilm reels and the schedules within reels are organized geographically: alphabetically by state, within states alphabetically by county, and within counties numerically by enumeration district. For census and ACS years since 1960, the census schedules exist in fully machine-readable form.
The basic sources for most of the IPUMS documentation are the documentation provided for each of the individual public use microdata samples. These are listed below. All of them should be available through the Inter-university Consortium for Political and Social Research (ICPSR), P.O. Box 1248, Ann Arbor, MI, 48106.

How is a record uniquely identified?      [return to top]

Three variables constitute a unique identifier for each household record in the IPUMS: YEAR, DATANUM, and SERIAL (year, data set number, and household serial number).

Four variables constitute a unique identifier for each person record in the IPUMS: YEAR, DATANUM, SERIAL, and PERNUM (year, data set number, household serial number, and person serial number).

Using IPUMS data

Are there tricky aspects of IPUMS data to be particularly aware of?      [return to top]

Some samples are weighted: each individual does not represent the same number of persons in the population. It is important to use the weight variables when performing analyses with these samples. See the PERWT variable for a listing of the weighted IPUMS-USA samples. For other samples the use of weights is optional.

For users interested in the 2000-2004 ACS samples, these samples do not include persons in group quarters.

It is important to examine the documentation for the variables you are using. The codes and labels for variable categories do not tell the whole story. In other words, the syntax labels are not enough. There are two things to pay particular attention to. The universe for a variable -- the population at risk for answering the question -- can differ subtly or markedly across samples. Also, read the variable comparability discussions for the samples you are interested in. Important comparability issues should be mentioned there. If a variable is of particular importance in your research (for example, it is your dependent variable), you are also well served to read the enumeration text associated with it. This text is linked directly to the variable, so it is quite easy to call it up.

By default, the extract system rectangularizes the data: it puts the household information on the person records and drops the separate household record. This can distort analyses at the household level. The number of observations will be inflated to the number of person records. You can either select the first person in each household (PERNUM) or select the "hierarchical" box in the extract system to get the proper number of household observations. The rectangularizing feature also drops any vacant households, which are otherwise available in some samples. Despite these complications, the great majority of researchers prefer the rectangularized format, which is why it is the default output of our system.

What are the major limitations of the data?      [return to top]

The data are composed entirely of individual person and household records from population censuses. There are no macroeconomic, business, or aggregate statistics. We do not deliver the published statistics from the population censuses.

IPUMS is composed entirely of sample data, with sample densities ranging from 1 percent to 5 percent of national populations. Some subpopulations may be too small to study with the sample data.

Because the data are public-use, measures have been taken to assure confidentiality. Names and other identifying information are suppressed. Most importantly for many researchers, geographic information is usually limited, sometimes severely.

Can I find particular individuals in the IPUMS data?      [return to top]

No. A variety of steps have been taken to ensure the confidentiality of the data. Most fundamentally, the modern samples do not contain names or addresses. The data are only samples, so there is no guarantee any given individual will be in the dataset.

How do I cite IPUMS-USA?      [return to top]

Reports and publications using IPUMS-USA data must be cited appropriately. The citation is:

Steven Ruggles, Matthew Sobek, Trent Alexander, Catherine A. Fitch, Ronald Goeken, Patricia Kelly Hall, Miriam King, and Chad Ronnander. Integrated Public Use Microdata Series: Version 3.0 [Machine-readable database]. Minneapolis, MN: Minnesota Population Center [producer and distributor], 2004.

Any publications, research reports, presentations, or educational material making use of the data or documentation should be added to our Bibliography. Continued funding for the IPUMS depends on our ability to show our sponsor agencies that researchers are using the data for productive purposes.

Can I use IPUMS for genealogy?      [return to top]

The use of the data for genealogy is expressly prohibited in the user license agreement to which all persons must agree. Ancestry.com provides information from the census that can be used for genealogical research.