Introduction

With the increase in data capabilities and the new advances in data analysis techniques over the past few decades, there has been growing interest in applying these techniques to improve the efficiency and quality of care and transform the healthcare sector. This interest has manifested in the form of both private and public investment in data analysis capabilities. The past decade has seen government healthcare entities digitalize and pool existing datasets in preparation to apply these techniques, and in the private sector, health service giants like Optum and Epic capitalizing on the technology to profit from providing personalized care. In light of these advances, this term that has been coined for the vast network of information is “big data,” and so in order to understand the implications of big data in healthcare, and its responsible use, the technology itself must be understood.

What is Big Data?

Big data statistics: 2.5 quintillion bytes of data created daily, 90% of the world's data was created 2 years ago, 6 exabitses of new data stored by consumers globally 2010, every 2 months the world's per capita storage doubles, 80% of the world's data is unstructured, 300 hours of new videos uploaded to YouTube per minute Big data can generally be understood as data that is difficult to collect, store, or process using conventional methods. This is data that is difficult to process not only because of the amount of information, but also because of the complexity of the data field in terms of amount of dimensions per sample data. From here, the three Vs of big data can be identified as the following:

  • Volume: There exists a vast amount of information to be organized.
    • Example: A Boeing 737 generates 240 terabytes flight data in a cross-US flight.
  • Velocity: Large amounts of new data is collected and must be processed at a high rate.
    • Example: There are billions of Internet users across there world and the Internet exchange data of those users are collected in real time.
  • Variety: Many different types of the data exist (text, video, audio, etc.).
    • Example: Relevant data for a Boeing 737 flight log not only includes all of the flight sensor data, but also black box audio recording.

To address big data problems machine learning algorithms are often employed to process this data at a very large scale. These algorithms are essentially trainable models that can be used to predict trends, and classify datasets, evolving over time as the datasets increase over time. Because of the variety of sectors that deal with big data, these machine learning algorithms are employed in many situations.

How is Big Data Used in Health Care?

Applications for big data in healthcare, diagnostics, preventative medicine, precision medicine, medical research, reduction of adverse medication events, cost reduction, population health In the medical profession, keeping patients healthy and preventing disease is the number one priority, and as such, big data is of most interest in this application. This can be observed in everyday life as more and more people adopt wearable devices that send biometric data to be stored in the cloud. While this may initially seem to be a low impact use case, this technology may have much higher impact in the near future in expanding medical diagnostic services. One such example is that Stanford has announced that it is teaming up with Apple to gauge the use of the Apple Watch heart rate sensor as a preliminary sign of atrial fibrillation. In addition to improving diagnostic ability, these big data techniques are often used to streamline and improve healthcare systems by finding trends that reduce operating room turnover time and also mitigate readmission through preventative medicine. On the surface level, it seems that the healthcare field stands to gain from big data with relatively few drawbacks. However, it is precisely these hidden risks inherent within big data techniques that can become problematic.

When Have Applications of Big Data Techniques Become Problematic?

The pitfalls of big data techniques are primarily a result of bias. To understand what this means, it is useful to split big data techniques into three separate categories so that these issues are more easily addressable:

  • The data used to represent or train the model
  • The algorithms used to process the data
  • The method by which the algorithms are applied

These are best understood through case studies.

In a case where the data used is biased, it is usually the case that more affluent patients will have more complete data. This is because patients of a lower socioeconomic status may bounce around hospitals because they cannot afford to only stay at one hospital. Additionally, they may not have received as high quality care or attention compared to more affluent patients, and as a result, the data record suffers. If the data here were used, the result would be skewed toward the more complete parts of data, or affluent groups of people.

Biased algorithms are the easiest to spot, and are often not the problem in the healthcare field. An example is an algorithm that blatantly favors American citizens to receive priority care over non-American citizens. This kind of bias is the most easily avoidable, and is the most easily prosecuted legally if this kind of bias infringes on protected classes defined by law.

The final case, that the method in which the algorithm is applied is flawed, has a rather famous case, the full research article provided here. A healthcare algorithm that was developed by healthcare giant Optum prioritized predictive care for white patients over black patients. However, this is not because the algorithm itself is biased; in fact, the algorithm did not even include race as a feature of the data. In fact, the reason why this occurred is because black patients, while having the same amount of chronic conditions as white patients, incurred around $2000 less in one year than white patients in medical bills. As a result, the algorithm would score white patients at higher risk than black patients. This trend eventually would become statistically similar enough to be noticed. From here, it is easy to see that in order to apply these algorithms in a socially responsible manner, more than just data scientists and computer scientists are needed; an accurate picture of the social landscape is also necessary.

Preventing Data Bias, Algorithm Bias, and Biased Use

The first step to preventing these problems from occurring is to first be aware that faulty data can be a problem. Social trends must be examined to determine the most at-risk groups that can be affected by the effects of incomplete data. These groups and their corresponding faulty data must then be taken into account when developing the models.

Additionally, a system must be put in place to examine whether or not an algorithm is discriminatory. These algorithms must be written to ensure inherent systemic inequalities will be taken into account in the model. For example, in the previous case where white patients were prioritized for predictive care over black patients, since cost was found to be the problematic data feature, the average costs incurred between each ethnic group can be balanced to ensure the equality of outcome of care. This regulatory system must also have sufficient analysis power to determine whether or not protected classes are discriminated against based on the outcome of the model.

The analysis power of the regulatory system also falls within the third category of responsible algorithm use. This monitoring system must not only be reactive, but also proactive, watching for social trends and disparities between groups as they emerge and proactively tweaking the algorithm and picking the data to make sure these problems are not exacerbated. 

This is the main challenge of the information age; with the vast amount and variety of data, these data analysis methods can become black boxes with problematic outcomes that may only become apparent when they are very serious. As a result, while these big data techniques promise to revolutionize the way we approach difficult modern problems, they also carry the risk of worsening societal tensions. It is our hope that with the treatments above, these harmful effects can be mitigated and that society can profit from the powerful new tools available for us to solve the problems of today and tomorrow.