Fragile families and child wellbeing study codebook

5/2/2024

This new metadata system streamlines the task of preparing FFCWS data for analysis, and we hope that it inspires future work to better scaffold new forms of data analysis in the social sciences. The redesigned system standardizes existing variables, provides an expanded set of metadata fields that reveal the data creators’ previously tacit knowledge about each variable, and makes the metadata available in a wide range of formats that support both manual and automated reading. Our subsequent redesign of the FFCWS metadata system follows their lead: we transformed a human-readable set of PDF documents into a machine-actionable system organized around a single comma-separated value (CSV) file, containing comprehensive metadata on all variables collected since the start of the study. Participants reported substantial difficulty in extracting basic information about each variable, frequently requested machine-readable metadata that were not available at the time of the FFC, and occasionally attempted to construct important metadata fields (e.g., variable types) independently. As we observed FFC participants, a unifying theme emerged: the task of preparing the data was a major obstacle, often preventing users from engaging more fully in the predictive modeling task at the heart of the challenge. As we describe in more detail below, this design principle emerged from observing the experiences of participants in the Fragile Families Challenge (FFC for more on the FFC, see the introduction to this special collection) as they attempted to navigate the metadata system for the Fragile Families and Child Wellbeing Study (FFCWS). In this article, we explore one approach to designing metadata systems: treating metadata as data. As machine-learning methods become more popular, researchers will need to design new metadata systems that can facilitate the use of these techniques. But these systems do not scale well to machine-learning methods, a setting in which researchers regularly work with hundreds or thousands of variables. Existing metadata systems can support standard methodological approaches in survey research, in which researchers typically construct models using a small number of variables. Yet machine-learning methods also amplify the costs and challenges of data preparation. Algorithmic approaches to specifying models and selecting variables have been used to enhance existing approaches in explanatory social research, and techniques designed for optimal predictive modeling and data exploration open social science to a complementary set of analytic goals ( Athey forthcoming McFarland, Lewis, and Goldberg 2014 Mullainathan and Spiess 2017 Watts 2014). Recently, researchers across the social sciences have begun to analyze data in new ways by applying techniques from machine learning. Learning to use these materials proficiently is widely considered a “massive professional investment” ( Abbott 2007 also see Freese 2007), particularly for researchers working in areas that draw heavily on data collected through complex, longitudinal survey designs. Traditionally, metadata systems in the social sciences have been formatted as sets of questionnaires, codebooks, and other written documentation. Metadata systems are critical research infrastructure: they provide researchers with an overview of the data, enable them to make informed choices about data preparation (recoding responses, dropping observations, etc.), and scaffold other crucial data processing steps that precede statistical modeling. Social scientists working with public data rely on metadata systems to navigate, interpret, and prepare data sets for analysis.

0 Comments

Fragile families and child wellbeing study codebook

Leave a Reply.

Author

Archives

Categories