Big data + small data in biotech

Jul 09, 2018

Data modeling can help biotech and pharma companies evaluate compounds, typically near the top of a “pipeline” or “funnel” of assays that will be used to determine suitability of the compound as a drug. And it can be used to design biomarkers once outcome (or “efficacy”) data for a compound has been generated. The promise of big data is that it will make these analyses more efficient, and therefore improve the quality of drugs and biomarkers being developed. “Big data” is a buzzword meant to convey the data’s large Volume, Velocity, and Variety compared to traditional data, and these are familiar concepts to more general data scientists, so I’ll approach this from that angle.

Volume

Big data volume in biotech is large. For example on the Broad Institute website, today (July 2018) they note, “Broad Institute researchers generate on the order of 20 terabytes (roughly equivalent to more than 6.6 billion tweets or 3,300 high definition feature-length movies) of sequence data every day.” There are various other institutes generating sequencing data on similar scales, and much of this data is public. In addition to R&D data from sequencing, assay groups typically do high throughput screening, and data output by mass spectrometers and high-content imaging systems get overwhelming very quickly. Additionally, operations teams are now typically employing IoT devices to monitor lab hardware. All of these add up to a large volume of data.

Velocity

Big data velocity in biotech is usually low. When people traditionally talk about big data velocity they are thinking of capturing tweets or likes or clicks, where tradeoffs are made at the data capture stage to be able to handle the throughput (relational database inserts will not handle it). In biotech, large volumes of data are generated in chunks, perhaps every day or every hour, but not every millisecond. The example from the Broad above is not typical for a biotech company, who may be using sequencing data at that scale, but not generating it at that rate. This low data velocity turns out to have huge implications for the Volume point above: because Velocity is low, you have the time and processing power to turn the big Volume of data into a much smaller amount of metadata for storage and downstream analysis. For example RNAseq data can be processed into a single expression value per gene (so, roughly 20K values for one human sample) or a single expression value per transcript (200K per human sample). These numbers are very small and easy to store and visualize efficiently.

Variety

Big data variety in biotech is really the kicker. Data comes in a huge variety of formats, and is often noisy. This data must be polished into something usable, or, frankly, ignored! Some datasets cannot be compared, for example if there have been slight changes in assay conditions or in pipeline processing code. This means that for a biotech organization to become “data driven”, they must think about their big data analysis well in advance of starting experimental work. I have found the best technique is to plan for a simple data storage solution as each new assay is coming online.

This article from McKinsey also gives a great overview of how big data can be used in biotech. As they note, “Having data that are consistent, reliable, and well linked is one of the biggest challenges facing pharmaceutical R&D.” More on that in another post.

So, some data in biotech is indeed big. However, much of the data in biotech is quite small, and the primary challenge is again variety and keeping that data organized.

For example a typical exploratory assay to assess a patient sample might look at a panel of genes’ RNA or protein expression values. You might still be tweaking your assay, perhaps deciding which genes to include. With only ~100 values to store per sample, it is tempting to store these in Excel sheets or Google Sheets or .csv files, and… I’m going to commit informatics heresy here and say that’s probably exactly what you should do for now, until you need to scale. Why?

Small data doesn’t necessarily require a LIMS system

When you set up a LIMS system, you put effort into designing the automation. For example picking the set of genes that should have values reported. When your assay is changing frequently, this effort is wasted. The primary benefit of the system at these early stages is to enforce certain requirements on your scientists, which I feel can be achieved much more cheaply and easily by just asking them!

You can be just as organized in a spreadsheet

The really important thing you can do in preparation for scaling your operation later on is to be organized. For example, ask your scientists to all set up their plates in the same format, e.g. with the control wells in the same place. Ask them to store the same information about how the experiment was run, and, more information is better than less. Create fields for everything you can think of (what cell line was used, what protocol, etc), and create a Notes field to capture the things you didn’t think of. As the experiment evolves you can pull new details out of Notes into separate required fields.

To motivate scientists to follow the spreadsheet organization system, first remind them that this is a flexible alternative to a LIMS system which would be even more strict about what they can and cannot do. Additionally, remind them that if their data is not formatted correctly, it may not be included in future automated analysis. For many scientists, these are not sufficient motivation because they view all of their work as exploratory or one-off. It’s up to any given organization how they want to approach this, but in my view, if data is generated but not stored in a reliable long-term fashion, it’s like the tree falling in the forest that nobody hears…

Spreadsheets can be backed up

Not much else to say about that! Backups are very important for any data you care about.

Your spreadsheets can be easily linked or migrated to databases or LIMS in the future

As your dataset grows, your software dev team will likely want to start treating your data as a database. For example creating web visualizations for your team to look at your data over time, or making comparisons across datasets. Because the spreadsheet can be edited by just anyone and nobody wants a tiny spreadsheet edit to break an important web dashboard, typically the spreadsheet is not used as the direct data feed, and instead a copy of the important data spreadsheet is made in a database. This copy is continuously updated, and code validates the data as it moves from the spreadsheet to the database, to ensure the formatting of it won’t break anything. The scientists can be notified when changes they make to the spreadsheet are not in the expected format.

Following this path makes it easy to migrate to a full LIMS system in the future When you reach the stage where your assay is production-scale, and you have outgrown the spreadsheet model (for example it’s too much work to update it, or too many people are editing it), the next step is to standardize the data input to the database. Scientists could enter data in a web form (or LIMS system form), or data could be automatically slurped from the instrument. Either way, the spreadsheet can be abandoned at this point as the data is immediately validated and entered into the production database.