Competing With Data & Analytics
What to Read Next
The goal of the Million Veteran Program is to understand how genes affect our health. As the project steams ahead, it is reducing the cost of research, shortening the time it takes to generate research results, and building a data infrastructure to allow for new kinds of analysis that will eventually lead to individual gene-based therapies for all veterans.
Dr. J. Michael Gaziano and Dr. Saiju Pyarajan, two of the program’s lead scientists, explained the protocols and infrastructure of the project in a conversation with Sam Ransbotham, associate professor of information systems at the Carroll School of Management at Boston College and guest editor for the Data and Analytics Big Idea Initiative for the MIT Sloan Management Review. Managing complex security and risk is perhaps their greatest challenge, which they are addressing on several fronts: All personal data collected for the study is stored behind a firewall in VA data centers at secure VA facilities; different data domains are separated using a system that prevents cross-referencing; and each participant’s data is stored with appropriate security both for data in motion as well as at rest, making it closer to impossible to identify personal data across all data domains.
Gaziano, national co-principal investigator for the project, described what the research team wants to accomplish. “Our hope is to set in motion a process or a project that will allow for the exploration of very complex questions with very sophisticated data-curation tools to begin to unlock the mysteries of the health care universe.”
Would you start us off by giving us some context for the MVP initiative and explain what it is?
I can give you a 10,000-foot view of what we’re trying to do and then discuss that in the context of what others in the same space are trying to do.
Several years ago, a number of us decided that we wanted to analyze a large cohort to better understand how genes and environment affect disease. Our timing was very good because, unlike previous similar efforts, we had the capability to do genetic analysis on a large scale at a reasonable price. We also had access to an increasing amount of health care data in electronic format, which allowed us to do the research at a relatively low cost.
We decided that veterans would be a good cohort for the study. We were already getting a lot of our clinical data from the Veterans Administration (VA) and the Department of Defense (DoD). We would also be able to take advantage of the volunteerism of the veterans in donating samples and consenting to use of the molecular and clinical data, but we knew we would have to augment that. For example, we would need to get data from the participants themselves for the kind of information that isn’t in health care records, such as quality of life, depression scale, and smoking status, etc.
At the same time that we were doing our research, President Obama announced a national research effort, the Precision Medicine Initiative, which would revolutionize the way we treat disease and improve health, and we’ve been involved in discussions about that research. The NIH and various other federal agencies like DoD are trying to do similar research, and we’ve also connected with the UK Biobank, which is an equivalent national cohort in the United Kingdom. We’ve also been working with other institutions around the country that are involved in similar types of activities. Vanderbilt [University] and Kaiser [Permanente] are two of them.
I think that we’re in a pretty good place now. We have more than 400,000 people in the study as of now and continue to recruit. We’ve gotten questionnaires back from about three quarters of them and biospecimens on all of them. We’re also doing some genotyping and have a few alpha and beta tests that will be resources for the study for the first time.
Why are you able to do what prior studies have not been able to do?
I’ll give you two reasons. The first is that we have technology to do multivariate modeling to better assess risk factors. This means that instead of looking at one variable and one outcome, we can look at 10, 12 different risk factors in the same model. The Framingham Heart Study developed this modeling, and it was a great leap forward. No one had ever done that before.
The second reason is that we have a large enough number of subjects. We’re trying to understand how genes work by doing genotyping on a large scale at a reasonable cost. There are 30,000 genes all interacting with each other, and there are lots of places within the genes and genome that have differences among individuals. That gives us a bigger problem, a massive signal-to-noise problem, and the only way to solve that is by having a large number of subjects to have better statistical confidence of our analysis results. Without the MVP-size cohort, we couldn’t have made any meaningful attempt to understand how genes affect disease.
So right now, you’re primarily in a data collection phase. Do you have any results yet? Has any analysis been done?
The only analysis we’ve done to date is a crude, core analysis from the questionnaires. This is helping us describe what the cohort looks like in terms of self-reported diseases and self-reported lifestyle factors. We’re looking at distributions of these cohort characteristics in the population to see how they compare to other veterans and non-veterans cohorts. We’re also building 7 alpha and beta test projects that are asking scientific questions, but we’ve not generated any answers yet. We’re building the systems to allow that to happen.
What kinds of difficulties beyond technical ones have you had in getting this project launched?
Where do we start? We had to make sure that all of the ethical and political stars lined up, so we did surveys of our participants and asked what they were concerned about. They wanted us to make sure that we kept their data secure and to make sure that everyone from the Secretary [of Defense] to Congress was aligned with the mission. That has gotten a lot easier now that the President has embraced the Million Veteran Program as part of his Precision Medicine Initiative.
We had to deal with all the logistic challenges of something that we’ve never built to scale before. We also had to get the permissions and access to various types of data, not only the data within the VA system but also from National Death Index and from Medicare data because some of our subjects use Medicare.
Another major logistic challenge was maintaining the security of the data. Instead of cutting data sets and giving them out to investigators, we will be bringing investigators to our secure computing environment and creating a research sandbox for them to do their work in.
What steps are you taking to manage the data to ensure privacy and secure access?
I think that it’s an important issue, and security is one of our major priorities. Because we are collecting a large amount of information, we are operating under the premise that anybody who joins the MVP program is vulnerable, and subject to all the implications that comes with loosing personal health information in today’s digital world.
The primary thing we are doing is putting in place multiple layers of security — the hardware layer, the software layer, and the physical layer. All our data is behind a firewall in secure VA data centers within VA facilities. We have also set up logical separations of different data domains, and none of the data domains can be cross-referenced. The crosswalks between the data domains are kept separate in an encrypted form, so even getting one piece or one domain of data doesn’t allow anybody to get the complete information.
We are trying to acknowledge that there are different levels of risk for different kinds of data and provide different levels of protection to each kind. For example, the data for personally identifiable information (PII) and personal health information (PHI) is handled very securely with very limited access. There’s a thorough governance process to go through before any researcher gets access to it.
Another way we secure data is through our Laboratory Information Management System (LIMS), where the data is identified only by the sample ID associated with their blood collection tube, which does not identify any personal information about the participant. And the crosswalk between a participant and the blood sample ID is separate — kept in a separate system in a separate hardware in a separate data center under encryption. There is absolutely no identifiable information in the LIMS system.
How receptive have the participants been to participating?
They’ve been incredibly supportive. We’ve gotten a few negative emails out of the three million plus invitations that we’ve sent out, but by and large the response is one of overwhelming support.
Why do you think that is?
For a lot of people, the VA is more than just a health care system. It allows them to continue to be part of a very special community. We’ve heard from them that this is a second way for them to serve — and to serve their brethren. The interesting thing is that while the patients want access to their data and want to improve their health care, that’s not the primary driver of their support. The primary driver is not necessarily improving their own health, but participating on a macro scale for the benefit of others is. I think that’s been one of the most rewarding things of being part of this project.
It’s been 70 years since the Framingham study. Do you think this project will accelerate the time it takes to get results?
Framingham has been generating information for 60 of those 70 years. It took them a little while to get the answers they needed. We’re hoping to begin generating answers to what we think are important questions and accelerate that process.
One final question. What do you see in the future for this project in particular and for health care in general?
Well, I think that we’re moving into a space where we need to move. We in health care have been a little bit behind the leading edge when it comes computing on “big data.” Our hope is to set in motion a process or a project that will allow for the exploration of very complex questions with very sophisticated data-curation and data-processing tools to begin to unlock the mysteries of the health care universe.
There’s one message that we always like to project. This effort is a thing with many heads, with many partners, and one of the most important are our veterans who are helping us build this enterprise. Without the veterans as willing partners none of what we are envisioning now to improve health care for our generation and the next generation will ever come to fruition.