Public Data Set Tips

Although there are several federal agency sources of public use data suitable for thesis work, one of the most extensive and easy to use sources is a data repository called the Inter-University Consortium for Political and Social Research (ICPSR). While many public use data sets are of a size that can be processed easily using a desk top computer, others can be large and require significant computing and manipulation power. Students should consider this when selecting data sets for thesis analysis.

A review of the type of journals that have accepted research conducted with the particular data set you are considering can be informative. Students may wish to focus on data sets that have supported studies published in particular types of journals such as epidemiology journals or top tier journals in their field. The study limitations of these papers can be instructive in regard to limitations of the scope of the data.

Although most public use data sets receive an exemption from full board IRB review, practices vary and researchers should check with their respective IRBs regarding whether their particular use of thesis data requires separate IRB approval.

It is recommended that researchers cite the electronic data files and documentation used in their analyses in their bibliography. The format for citing the data and documentation for the thesis bibliography is available on this ICPSR website. In addition, at this website, students can inform ICPSR that they have used its data in their theses. This facilitates ISCR’s compilation of bibliographic references reporting on the how data obtained from their site has been used.

The Inter-University Consortium for Political and Social Research (ICPSR)

ICPSR offers data use tutorials and instructions for new users. An easy-to-use search capability facilitates the identification of data sets by topic. Once a list of data sets has been identified by topic or subject area, one can access data set descriptions, data, and data documentation.

Computer code, including SAS code, is available frequently to facilitate creation of an initial analytic data set once a particular raw data file has been identified. The inclusion of literature and citations for published studies allows easy review of the kinds of questions and issues other researchers have investigated using the selected data set. Columbia University is a member of ICPSR so students generally have free electronic access to data listed on the site. Data on CD-ROMS may require purchase.

Identifying Thesis Suitable Data Sets Held by ICPSR

  • Use the search engine available on the site to browse data sets by subject area. Under search, select data holdings. One can search all fields or by a particular field: title of data, data set number, investigator, or subject terms. Type in the subject for which you wish to locate data (AIDS will produce over 100 data sets, not all of which may be appropriate to your research question). Similarly, epidemiology, mental health, substance abuse, and others will produce lists of data sets one will want to examine more closely.
  • If the summary description looks promising, the actual codebook containing listings of the variables available in the data set can be examined. Check the sample size to ensure that it is adequate to answer the research question you plan to ask and that it is within the computing power of your resources.
  • Before finalizing your selection, review the bibliographical section that accompanies the data. This bibliography provides a list of prior peer-reviewed published studies and reports that have used the particular data set they are considering.

IRB Guidelines for ICPSR Data
Information regarding federal regulations governing research projects using ICPSR data is available at the ICPSR website and at the Columbia University website. In many institutions, secondary analysis of existing datasets from pre-approved sources, such as ICPSR fall into an exempt category. Although these decisions are based in federal law, the decision of which studies require IRB approval is made on an institution-by-institution basis.

Federal Agency Sources for Secondary Data

National Center for Health Statistics
The National Center for Health Statistics has a well developed survey and data collection system providing a wealth of data of epidemiologic interest to researchers. NCHS says it has "two major types of data systems: systems based on populations, containing data collected through personal interviews or examinations; and systems based on records, containing data collected from vital and medical records. Some NCHS data systems and surveys are ongoing annual systems while others are conducted periodically." The scope and descriptions of their data can be viewed on their website.

Agency for Healthcare Research and Quality
AHRQ, the Agency for Healthcare Research and Quality, has several data sets available for secondary data analysis. State and federal level hospital data are available including one data set of 8 million hospital inpatient stays representing a 20% sample of U.S. community hospitals. Most require purchase, but students receive heavily discounted rates sometimes as little as 10% of the $200 price charged to other investigators. This agency has several data series available to researchers. The HCUP data series is a particularly rich source of data covering virtually every health condition or injury mechanism found among hospitalized patients.