Study Design and Data Analytics

The new Study Design and Data Analytics Facility Core (Data FC) provides intellectual and computational resources to plan, conduct, and analyze studies by continuing to support existing services, including:

  • design of observational, clinical, and experimental studies;
  • biostatistical consultations;
  • data management.

The Core’s expanded capacities include:

  • processing and analyzing –omics data;
  • training and guidance on handling and analyzing Geographical Information Systems (GIS) data;
  • identifying, accessing, and analyzing the growing number of large data repositories in the public domain (“big” data).

Throughout these activities, the Core will:

  • offer educational and training programs on the use of these methods;
  • assist with analysis and interpretation of data, and manuscript preparation;
  • foster attention to the development of new methods/tools.

The unique combination of new and existing services in the Core will multiply opportunities and effectiveness in using the rich data that CEHNM members generate—as well as those already available in-house or in the public domain—to answer groundbreaking questions in environmental health research.

Core members have expertise on all aspects of research planning, including study design; measurement of exposures, covariates, and outcomes; data management; spatial and temporal analysis; statistical power calculations; and statistical analysis.

Core Director: prf1 [at] columbia.edu (Pam Factor-Litvak, PhD)

    Study Design Consultations:  prf1 [at] columbia.edu (Pam Factor-Litvak, PhD) and jlt12 [at] columbia.edu (J.L. Thompson, PhD)

    GIS Education: jp3323 [at] columbia.edu (Jeremy Porter, PhD)

    Public Domain Data: prf1 [at] columbia.edu (Pam Factor Litvak, PhD) and sw2206 [at] columbia.edu (Shuang Wang, PhD)

    Biostatistics:  xl26 [at] columbia.edu (Xinhua Liu, PhD)

    -Omics Support:  sw2206 [at] columbia.edu (Shuang Wang, PhD)

    Data Management Leader:  jlt12 [at] columbia.edu (J.L. Thompson, PhD)

    Data Manager:  rb539 [at] columbia.edu (Richard Buchsbaum)
 

Description of Services

The goal of the Study Design and Data Analytics Facility Core is to respond to the growing need for using complex design and advanced computational methods and tools in environmental health sciences by providing a wide array of services to support the highly interdisciplinary research of the CEHNM. The Data FC is integrated into this new structure, which includes the Integrative Health Sciences Facility Core (IHSFC), the Exposure Assessment Facility Core, and the Study Design and Data Analytics Core (Data FC). Our investigators are extremely well set up for a highly effective pipeline of science that begins with the IHSFC being the port of entry and coordination with the other Facility Cores—including bringing in the Community Engagement Core for facilitating community based participatory research approaches, when appropriate, or translating final results. In this pipeline, the new Data FC will provide invaluable conceptual input in the study design and power calculations, data collection, and database support in the study’s first stages, followed with support for collecting, measuring, processing and storing environmental and biological samples, provided by the IHSFC and the Exposure Core. The Data FC then provides continued support with a unique array of data management, data analysis, and bioinformatics services, finally culminating in support for data interpretation, as well as results presentation and reporting. Investigators may elect to use all the services to fully support their studies or select only those needed to complement their expertise and research focus. For instance, basic scientists may greatly benefit from molecular/omic wet-lab services and bioinformatics, but may not be interested in specialized support for study design such as questionnaire design.  All services adhere to the NIH Rigor and Reproducibility Policy. 

Facilities

In order to meet the growing needs of data management, the Data FC has built and maintains a state-of-the-art comprehensive infrastructure that includes a highly secure network, an extensive data entry facility, and wide ranging software capabilities. The data management facility’s computing center is located in and provided and maintained by the Statistical Analysis Center (SAC) in the Department of Biostatistics. The SAC has computer resources in two locations- in their offices at Mailman and in a secure computer co-location (which also serves as an emergency recovery center) in downtown Manhattan. That facility, the New York Internet Company (NYIC), provides a high security, high-uptime data infrastructure, including multiple redundant power supplies and internet access. SAC servers at NYIC are protected by both hardware and software firewalls and continual process monitoring. NYIC provides 24 x 7 x 365 on-site staff for routine issues, and the SAC has an ongoing maintenance contract with Technology Campus, Inc. to provide advanced technical assistance. All servers hosted at NYIC are backed up regularly to both tape and hard-drive media and copies are also regularly stored off-site. The computer facility at Mailman resides behind the campus firewall and also employs software firewalls. These computers are backed up daily to removable hard drives, which are routinely stored off-site.  

The CEHNM’s resources include specialized software libraries for data analysis and management. These programs have been written and compiled by Data FC statisticians and other members of the Biostatistics Department. The software library includes an exceptional variety of applications written in APL by Dr. Bruce Levin, a senior biostatistician, who has granted CEHNM members access. The library includes software for exact discrete-time survival analysis with arbitrarily structured time-dependent covariates and unlimited numbers of tied observations, multinomial and related distribution problems for exact analysis, logistic regression and conditional likelihood analysis for matched and finely stratified samples, and adaptable statistics and graphing tools. All Facility Core members are facile in the R programming language, in standard statistical software packages such as SAS, SPSS and STATA, and in many specialized packages and applications.  

Sophisticated data management packages include the use of Microsoft Access, structured query language (SQL) and Research Electronic Data Capture (REDCAP) for building and managing online survey databases. Mr. Buchsbaum and his predecessor, Ms. Diane Levy, developed custom designed project management software routines, custom designed databases, using these software packages, for multiple large projects, and various modules designed to facilitate data entry and data cleaning to answer data queries to display data. The database includes capabilities for remote data entry. This data management system now accommodates a wide array of CEHNM studies. The Data FC team has developed conventions for naming and coding variables, making it possible to harmonize data between studies, if ever warranted.  

The Department of Systems Biology (DSB) provides high-performance computing and storage for 1,000+ academic scientists. DSB's computer cluster consists of 5,000+ cores with 40TB of RAM. A mesh-network interconnects all cluster nodes at 10Gbps. The cluster has 240Gbps of connectivity to multiple network attached storage arrays. The storage arrays make available nearly 5PB of highly resilient distributed storage. DSB provides a variety of best-practices data storage protocols, including HIPAA-compliant security measures, as well as regular data snapshots, replication, and offsite backup.