Data Science Summit Models the Future of Public Health

January 29, 2020

Ask X number of scientists to define data science, and you could get X^X different definitions. What’s certain, however, is that this fast-growing field is transforming just about every discipline, not least of all public health. A daylong Data Science for Public Health Summit organized by Columbia University Mailman School of Public Health brought together public health leaders to consider the many dimensions of data science in public health, including how its methods can answer complex research questions with attention to scientific rigor and ethical values.

In opening remarks, Dean Linda P. Fried said “analytic tools that can broadly be defined under the umbrella of data science are a major part of the solutions we need” for contemporary challenges from chronic diseases to the health effects of climate change. “Schools of public health must become facile in a wide range of tools and techniques that fall under data science. … We have to be able to use data science tools and approaches to solve our research problems, but just as importantly, we have to impart these competencies to our future leaders.”

The January 17 Data Science Summit took place at a moment when the Columbia Mailman School and the broader university are investing in the field. The Columbia Data Science Institute serves as a hub for collaborations between 350 faculty from every corner of the university—including public health. Together, the researchers work to advance techniques to gather and interpret data to better address urgent problems facing society. In the past year, the Institute announced six seed awards; two of those grants involve Columbia Mailman researchers. Meanwhile, new opportunities are becoming available for public health students and post-docs to hone data science skills like machine learning.

Summit attendees from more than 60 public health schools gathered in a large meeting room in the Vagelos Education Center overlooking the Hudson River. Among them were statisticians, mathematicians, biostatisticians, physicists, biologists, epidemiologists, economists, environmental scientists, demographers, sociologists, and computer scientists—many who consider themselves data scientists. Discussions featured leaders from across academia, including biostatisticians from many of the top schools of public health. Commenting on the diversity of data science stakeholders and varied definitions of the field, Gary Miller, the Columbia Mailman School’s Vice Dean for Research Strategy and Innovation and the Summit’s organizer, said, “If it matters to the people in this room then it is data science.”

While there is no agreed-upon definition of data science, Jeanette Wing, Avanessians Director of the Columbia Data Science Institute, said the field can be summed up as “the study of extracting value from data.” The word “value,” was subject to the definition of the end-user, but data science must not be value-free in an ethical sense, she added. Instead, the field must “ensure responsible use of data to benefit society.”

The keynote speaker was Robert Tibshirani, professor of Biomedical Data Science & Statistics at Stanford University and a leading figure in the development of quantitative methods. He gave an overview of data science techniques that “train” algorithmic models to make predictions related to a given research question. These popular and powerful artificial intelligence tools have offered valuable insights on public health questions, such as by forecasting seasonal influenza outbreaks. Even so, Tibshirani argued that statistics and biostatistics should always have a seat at the research table. The latter fields are uniquely positioned to optimize study design, uncover biases, and assess the generalizability of findings. “We can’t take the shortcut to running algorithms,” he said.

Of course, researchers aren’t the only ones excited by data science. Students, too, are eager to run their first algorithm. Kiros Berhane, the new chair of biostatistics at the Columbia Mailman School, explained that when courses aren’t offered, students will “take matters into their own hands” to learn these data science skills independently. Columbia Mailman has enriched its biostatistics programs with data science courses that complement traditional training in statistical inference. In the job market, too, there is a growing demand for data scientists—a situation leading many public health graduates to take positions in companies like Facebook and Google.

In a panel discussion on ethical considerations of data science, moderator Alice Park, a health reporter at Time Magazine, pointed to a conundrum: “Data should be objective, but data science isn’t always.” This disconnect comes about due to the ways we collect data as well as how we interpret and apply it, she added. Jeff Goldsmith, associate professor of biostatistics, said Google Translate initially reproduced gender biases present in the data the company’s scientists used to train its algorithm. For example, male pronouns were paired with words like “hardworking” while female ones were linked to “lazy.” An update by Google seems to have remedied the problem. Another widely discussed ethical question has to do with our right to control the data collected about us (the European Union has instituted a “right to be forgotten” when it comes to data collected through web searches).

For these and other ethical questions, Goldsmith concluded that public health—because of its deep-seated emphasis on research rigor and social justice—is well-positioned to put data science on the right path. “Public health has a head start on a lot of these issues,” he said. “We think a lot about where our data come from and who they represent and whether or not people have consented to share their information this way. We think an awful lot about the practices and the values we want to instill in our [scholarly] community.”

Watch the Data Science for Public Health Summit: