NIH Associate Director for Data Science Discusses Opportunities and Challenges of Data Science
At the September 1 meeting of the National Institutes of Health (NIH) Division of Program Coordination, Planning, and Strategic Initiatives (DPCPSI) Council of Councils, Associate Director for Data Science (ADD) Philip Bourne discussed the opportunities and challenges of data science. Bourne began by asking: “What are we going to do with our data?” He explained that from his point of view, science is at point of significant change as a consequence of the amount of data that is being generated. Biomedical research is becoming more analytical, and scientific change is happening faster than anticipated as a result of faster accumulation of data. He suggested that one could now argue that biomedical research is at that point of digitization, noting that preclinical research has been digital for quite some time. The convergence of the use of the electronic health record (EHR) represents a further drive toward science digitization. Consequently, he stressed the need to assume that there will be some sort of “disruption” going forward, that is, a change in the way science is conducted as a result of digitization. Accordingly, NIH needs to begin thinking about how it will deal with any such disruption, said Bourne.
Bourne explained that a scientific motivator for addressing the opportunities and challenges around data science is the President Obama’s Precision Medicine Initiative (PMI). It is an example of where large amounts of clinical data will be brought forth via PMI’s call for the creation of a National Research Cohort of more than one million U.S. volunteers consisting of existing cohorts, many of them funded by NIH, along with the recruitment of new volunteers (see Update, September 22, 2015). There is a need to make maximum utilization of these scientific tools, he explained. The NIH’s response is the creation of the Office of Data Science (ODS), with the mission to “use data science to foster an open digital ecosystem that will accelerate efficient, cost effective biomedical and behavioral research to enhance health, lengthen life, and reduce illness and disability.” The office is currently engaged in fulfilling the recommendations of the Advisory Council to the NIH Director Working Group on Data and Informatics (See Update, January 14, 2013).
Bourne further stressed the need for NIH-supported data to be findable, accessible, interoperable, and reusable, or “FAIR.” Accordingly, the NIH has provided supplements to researchers to examine the interoperability of data. An added challenge is bringing in individuals who are currently not part of the community, including statisticians and the general public, among others. Bourne discussed the fact that currently data collected is used to generate publications and often atrophies afterwards. After publication, 88 percent of data is no longer available. To address this and other data-related issues, the NIH is investing $100 million a year in the Big Data to Knowledge (BD2K) Initiative. To date, 12 BD2K centers have been funded; 20 percent of a center’s budget goes toward training to address the demand for individuals able to conduct this science.
ODS’s strategy includes addressing the components of infrastructure, policies, and communities. ODS is funding the indexation of previously generated data to make them findable and useable, Bourne reported. It is also supporting the indexation of software, standards, and other elements referred to as research elements (e.g., publications, research papers, and course materials). These activities are being done in what Bourne described as the Commons, which is designed to serve as an underpinning of the digital ecosystem. The Commons is being used to package and identify content. He pointed out, however, that NIH is not building a massive infrastructure but is essentially using existing infrastructure and adding rules to tie it all together. He ventured the possibility that the data could be stored on public clouds, pointing out that the cloud is where NIH is focusing its first efforts, along with efforts to aggregate the data. Exposed to a level that can be indexed, this data includes patient data, said Bourne. Additionally, the Commons is being used to publicize that the data exists. Pilots are being run to check the models. Labs outside of the NIH labs are involved, adding that it is easy to become Common compliant.
Bourne also emphasized that the constraints of the NIH’s budget are another aspect driving NIH’s efforts, along with issues surrounding sustainability. ODS is running a pilot on a different kind of funding model. He explained that currently when NIH funds a grant, there is a line for computing costs, which does not necessarily match supply to demand. ODS is experimenting with a “credit model” where instead NIH gives researchers a credit for computing and pays for only what is used. The credit can be used for any Commons-compliant resource and has the added effect of driving other resources into the Commons, making them indexable and accessible, he explained. It allows NIH to have the ability to measure how the data is being used. A Phase 1 pilot is being conducted in several sites to ascertain what scientists think about the potential model.