Small Data

September 29, 2015

The buzz around “Big Data” started about five years ago. Its first Wikipedia entry from 2010 cites “new developments and press attention make this a notable concept.” Since then, the phrase, loaded with the hopes of harnessing ever-growing rivers of data from fitness trackers to brain scans, has swept through the cultural conversation, becoming a darling for developers, marketers, statisticians, and public health professionals.

But with all eyes on big data, an equally important concept has gotten lost in the shuffle: small data. Jeff Goldsmith, assistant professor of Biostatistics at the Mailman School, has done substantial research on both ends of the spectrum. Transmission sat down with him to understand why bigger isn’t always better.

What’s your beef with big data?

The sheer quantity of information [from big data projects] can make it very tricky for statisticians to separate a true signal from noise. In some of the hallmark big data subjects like neuroimaging and genomics, there is so much to sort through. You have to be very specific about what exactly it is that you are asking from the data, and whether or not you have the right statistical tools to find the answers.

How does small data come into play?

Generally, small data comes from a relatively small number of subjects who are measured on a small number of variables. The useful thing isn’t necessarily whether it’s big or small, but whether the data is answering your question. So, in cases where people are collecting small data sets and know what they want to ask, what they want to learn, and have the data that allows them to do so—small is wonderful.

Are there examples of how small data is used in public health?

Sure. If you look at observational studies about physical activity and nutrition, the effect of chemotherapy on cognition, alternative treatments for opioid dependence, or even the accuracy of a Fitbit in determining step counts (a single number that provides a narrow piece of the overall big picture)—these are all small data projects. Another common example is pilot-phase clinical trials where you are measuring small numbers of people in a few important variables, randomizing people into two treatment groups, and measuring pre-specified outcomes. Small data applies to almost every area of public health.

Is there room for both big and small?

Some questions, like the connection between genes and the environment, or biomarkers and Alzheimer’s, simply cannot be addressed without big data. Taken by itself, big data doesn't guarantee certainty; it simply poses more complex challenges. So big and small data can coexist, but it boils down to the fact that both must be used to answer the right scientific questions.

Where do we go from here?

There’s a bit of an arms race right now: who has the biggest data, and how is the definition of big data changing with all this massive growth? For me, the idea that more data translates into more knowledge is not necessarily true. The current infatuation is here to stay for sure, but that doesn’t mean we have to forget about the stuff we already know. Small data will continue to be useful for many years.