AI and healthcare: Overcoming common challenges
The intersection of AI and healthcare
It’s a safe bet that you’ve interacted with artificial intelligence and machine learning (AI/ML) in some way today, whether you’ve seen a personalized ad or asked your smart assistant for the weather forecast. Even though AI is all around us, the most impactful and human-centric industry being revolutionized by AI is healthcare and life sciences (HCLS).
The underlying AI/ML tools, technologies, and methodologies used by HCLS and other industries are generally the same. While ecommerce has personalized ads, HCLS has personalized medicine. While the finance industry uses anomaly detection to identify fraud cases, doctors and physicians leverage similar methods for diagnosing early-stage cancer.
The commonalities shared across these applications begin with the mathematical models that learn the relationships in data used to train them, followed by the computational architecture (i.e., the “the cloud”) that powers the models, and end with the software written to bridge the gap between the two. This consistency offers myriad advantages, but perhaps most importantly it paves the way for novel AI/ML applications outside of research institutions and giant technology corporations.
One shining example is Riya Bhatia’s research on real-time detection of fibromyalgia. Riya, a high school senior and participant in AI4ALL’s Changemakers in AI program, uses machine learning and computer vision techniques to identify facial expressions while concurrently measuring the stiffness of a patient’s skin via a special glove to predict a fibromyalgia diagnosis. This novel approach to non-invasively diagnosing a pain disorder—a particularly difficult task given the complexity of measuring the intensity of pain and its subjective nature—highlights the opportunities for AI/ML to enact real, positive change.
Regulatory barriers
The HCLS industry faces unique challenges compared to other industries, especially with the many regulations that surround sensitive healthcare data. Inspiring examples like Riya’s tend to obfuscate these issues.
Accessing data—a prerequisite for building AI/ML applications—is often the most difficult and time-consuming step of the development process. Additionally, all systems must maintain various levels of heightened security when interacting with the data. While healthcare data regulations are necessary and undoubtedly a reliable mechanism to prevent the misuse of sensitive data, adhering to them takes time, effort, and money. Even more work is required to sufficiently anonymize data to make it publicly available for the benefit of the entire HCLS community.
HCLS data complexity
Taking the regulatory aspects out of the equation, healthcare data also differs in complexity compared to other industries. You’ve likely had to perform the somewhat-annoying exercise presented by CAPTCHAs to verify that you’re a human (aside: a testament to how advanced AI solutions have become). Identifying “which picture contains a bridge” is a simple task for anyone with relatively good eyesight, but if the activity were changed to something healthcare-related, like identifying “which MRI depicts a malignant tumor,” those of us without medical degrees and years of experience specializing in cancer diagnosis would fail the test and be wrongfully flagged as bots.
This example illustrates how complex the data tends to be in the healthcare space compared to other industries. Self-driving cars can reliably use the millions of accurately labeled bridge images to enhance the AI systems that support them, but in stark contrast HCLS relies on a handful of subject matter experts to generate high-quality data labels, such as “malignant” and “benign” MRI or CT scans. These labels are particularly important because they dictate what the machine learning model will predict. After all, an AI’s intelligence stems from the data it’s trained on.
Creative solutions to data scarcity
The effects of strict regulations and the inability to generate accurate data labels at scale manifest as “data scarcity,” or a deficit of high-quality data (i.e., accurately and reliably labeled) suitable for AI/ML purposes. Downstream, this translates into additional hurdles faced by researchers and developers like Riya. Ideally, facial expressions of fibromyalgia pain coupled with skin stiffness measurements and patient-reported pain scores would comprise the data foundation for her solution. Unfortunately, an accurately labeled dataset with these components simply doesn’t exist (at least not publicly), and she was forced to take matters into her own hands. Riya took components from various datasets and collected anonymous data, and with a little creativity and engineering, trained her ML model.
Similarly, researchers and industry leaders in HCLS tend to rely on open-source, public data that somewhat—but almost never perfectly—aligns with their existing data to act as a supplementary source of information for their AI/ML applications. In fact, there are several initiatives and organizations, such as the Cancer Genome Atlas, that are committed to fighting data scarcity in this space by publicly hosting anonymized medical data.
Representation for all
Unfortunately, historically marginalized communities tend to take the brunt of the “data scarcity” phenomenon. They’re routinely underrepresented in datasets, which dampens the effectiveness of these AI/ML solutions on their communities. In some cases, this translates into biased AI systems, through no fault of the ML models powering them. In practice, this looks like lopsided survival rates of breast cancer in women of color, inaccuracies in identifying genome substitutions or deletions between different DNA sequences, as well as the inability to detect different body types or skin colors—to name a few examples. Riya went to great lengths to create data herself to supplement the underrepresented skin tones in her publicly sourced data for her fibromyalgia AI/ML solution.
This issue is prevalent not only in HCLS but across industries, datasets, and institutions, sometimes with devastating consequences. It’s important to note that even when an AI solution doesn’t intervene in life-and-death scenarios, it can still both directly and indirectly affect people’s health and well-being, especially over time.
Regardless of industry, organization, or use case, bolstering the representation of these communities in datasets used to power AI can and should be at the top of every AI practitioner’s mind. AI truly is a technology for all, and our data should reflect that.
Challenges in data size
Suspending the “data scarcity” issue for a moment, let’s imagine Riya had all the CT scans, digitized biopsies, MRIs, and electronic medical records that she could hope for and that it was all accurately labeled. The individual data “points” themselves still present a problem that tends to be unique to HCLS: their size.
Consider a digital image of a tumor biopsy. These images are typically tens of gigabytes each. High levels of detail are necessary for effective analysis. In the physical world, a physician would study the biopsy under a powerful microscope. In the digital world, this translates into massive file sizes. To put this in perspective, one biopsy image is roughly three times larger than an hour-long TV show in HD.
When taking into consideration the scale of data points typically required to train a reliable AI/ML model to predict an outcome, the sheer storage size becomes staggering. As if that wasn’t enough, transporting and processing such large data further compounds the computing power, time, and effort required to generate these AI/ML solutions.
Accounting for massive file sizes
AI/ML practitioners combat this through creative approaches to data pre-preprocessing, modeling, and computing architecture. For example, large images are routinely broken into numerous smaller images in a process called “tiling.” This reduces the computational strain downstream while providing the opportunity to filter out non-informative regions of images.
On the modeling side of things, semi-supervised learning, a machine learning paradigm that uses “incomplete” data labels (as opposed to the standard supervised learning where each data point has its own label), works well in tandem with the tiling strategy. This approach also helps counteract the data scarcity phenomenon. When considering the engineering difficulties of such a strategy, distributed computing with graphics processing units (GPUs) helps decrease the amount of time it takes to train a model by parceling out data to additional machines that undergo the training procedure in parallel, ultimately communicating what they “learned” back to the machine that generates predictions (and orchestrates the distributed training process) at the end of the day.
Creating a healthier tomorrow
By leveraging open-source data; ensuring accurate, inclusive representation in training data; utilizing novel data-manipulation strategies; and harnessing the cloud’s power of parallelization on GPUs, future leaders like Riya can continue creating solutions to critical, complex problems to help create a healthier tomorrow.
This blog post was originally published here.