Microdata Files and Data Licensing

The National Center for Science and Engineering Statistics, (formerly The Division of Science Resources Statistics) like all Federal agencies, is bound by the Privacy Act of 1974 to protect the confidentiality of the records it maintains about individuals. Further, even when data are not covered by the Privacy Act, it may be necessary to assure respondents (both individuals and institutions) that we will not divulge the information they provide to us except in a format that will not permit identification of the respondent. We are, of course, obligated to honor all such assurances.

In some cases protection of confidentiality is fairly straightforward. We may simply need to delete identifying information (such as name and address) from the records. In other cases, however, such straightforward methods may not be adequate. This is true for most of NCSES's microdata files that contain information about individuals.

When we believe that we cannot issue a data file containing complete records from a survey, we attempt to develop a public use file that provides researchers with as much microdata as feasible, given our need to protect respondent confidentiality. We achieve this goal by suppressing selected fields and/or recoding variables.

In some cases NCSES staff believe that protection of respondent confidentiality would require such extensive recoding that the resulting file would have little, if any, research utility. In these cases we do not issue a public use file. However, we have developed a variety of methods to assist individuals in using the data in this situation. In some cases, researchers are able to state their needs for tabulations or other statistics with sufficient specificity that necessary summary information can be provided without the need for access to microdata. In other cases, NSF and the researcher can execute a license agreement that permits the researcher to use the data files at NSF's offices in Arlington, Virginia or at the researcher's academic institution. For details on obtaining a data license see the NSF/NCSES Restricted-Use Data Procedures Guide.

Microdata files for the following surveys may be obtained under a license agreement with NSF.

  • Survey of Earned Doctorates
  • Survey of Doctorate Recipients
  • National Survey of Recent College Graduates
  • SESTAT Integrated Data File

NCSES Policy on Matching Data to Restricted-Use Data Sets, Word format (DOC 44K) PDF format (PDF 25K) Revised: October 19, 2006

Researchers interested in obtaining information about using the NCSES restricted data files, or for information on NSF confidentiality and privacy policies, are requested to contact:

Dr. Nirmala Kannankutty

Senior Advisor
National Center for Science and Engineering Statistics, Room 965
National Science Foundation
4201 Wilson Boulevard
Arlington, VA 22230
Phone: (703) 292-7797

For information on licensing procedures, please contact:

Adrian McQueen

Licensing Coordinator
National Center for Science and Engineering Statistics, Room 965
National Science Foundation
4201 Wilson Boulevard
Arlington, VA 22230
Phone: (703) 292-7807

Public Use Files

GSS Public Use Files

The Survey of Graduate Students and Postdoctorates in Science and Engineering (also known as the graduate student survey, or GSS) is an annual census of all known academic institutions in the United States that grant master's degrees or research doctorates, make postdoctoral appointments, or employ doctorate-holding nonfaculty researchers in science, engineering, and selected health fields. Data on graduate student enrollment are collected by field of study from administrative records. Data are collected at the organizational unit level (GSS-eligible department, degree-granting program, research center, or health facility within the reporting school).

Data include the citizenship and racial/ethnic background of graduate students enrolled; counts of full-time graduate students by source and mechanism of support and by sex; and counts of part-time graduate students by sex. The survey also requests count data on postdoctoral appointees by source of support, sex, and citizenship, with separate data on those holding first-professional doctorates in health fields, and requests summary information on doctorate-holding nonfaculty research personnel.

Data from the GSS are made available in public use data files. The files include publicly releasable data for all the years of the survey (1972–2010).

Tools Top of Page.

WebCASPAR (Integrated Science and Engineering Resources Data System)

WebCASPAR is a database system containing information about academic science and engineering resources and is newly available on the World Wide Web. Included in the database is information from several of NCSES's academic surveys plus information from a variety of other sources, including the National Center for Education Statistics. The system is designed to provide multiyear information about individual fields of S&E at individual academic institutions. The system provides the user with opportunities to select variables of interest and to specify whether and how information should be aggregated. Information can be output in hard copy form or in Lotus, Excel or SAS formats for additional manipulation by the researcher.

IRIS (Industrial Research and Development Information System)

IRIS links an online interface to a historical database with more than 2,500 statistical tables containing all industrial research and development (R&D) data published by NSF from 1953 through 1998 2007. These tables are drawn from the results of NSF's annual Survey of Industrial Research and Development, the primary source for national-level data on U.S. industrial R&D.

IRIS resembles a databank more than a traditional database system. Rather than firm-specific microdata, it contains the most comprehensive collection of historical national industrial R&D statistics currently available. The tables in the database are in Excel spreadsheet format which are easily accessible either by defining various measures (e.g., total R&D) and dimensions (e.g., size of company) of specific research topics or by querying the report in which the tables were first published.

SESTAT (Scientists and Engineers Statistical Data System)

SESTAT is a comprehensive and integrated system of information about the employment, educational and demographic characteristics of scientists and engineers in the United States and is intended for both policy analysis and general research, having features for both the casual and more intensive data user.

SESTAT currently contains data from three NSF-sponsored demographic surveys, including 1999 survey responses from about 100,000 individuals. The NSF surveys provide compatible data which have been merged into a single integrated data system. These samples represent statistically about 13 million persons with science and engineering degrees. For additional information about the SESTAT system and the data it contains see "SESTAT: A Tool for Studying Scientists and Engineers in the United States" (NSF 99-337).

SED Tabulation Engine

The SED Tabulation Engine, a pilot data tool that NSF is testing, will provide access to selected variables from the Survey of Earned Doctorates (SED). It complements the WebCASPAR tool by performing tabulations on data from 2006 and beyond. The tabulation engine includes a disclosure control mechanism that protects the identity of respondents when using the gender, citizenship, and race/ethnicity variables.

The SED began in 1957–58 to collect data continuously on the number and characteristics of individuals receiving research doctoral degrees from all accredited U.S. institutions. Information from this survey becomes part of the Doctorate Records File (DRF). The DRF contains data on all earned doctorates granted by U.S. universities in all fields from 1920 to the present. The results of this annual survey are used to assess characteristics and trends in doctorate education and degrees. This information is vital for educational and labor force planners within the federal government and in academia.

Social and Economic Implications of Information Technologies:

A Bibliographic Database Pilot Project (Road Maps)
Archived December 2006

Computerized bibliographic search algorithms and consultations with research experts were used to identify over 5,000 data sets, research papers and books, and Web sites that provide insights about the social and economic implications of information, communications, and computational technologies (IT). Citations to these works have been sorted into a series of searchable listings called Road Maps and include materials compiled from 1998 through 2004.

