Heal: A Framework For Health Equity Assessment Of Machine Learning Performance

Posted by Mike Schaekermann, Research Scientist, Google Research, and Ivor Horn, Chief Health Equity Officer & Director, Google Core

Health equity is simply a awesome societal interest worldwide pinch disparities having galore causes. These sources see limitations successful entree to healthcare, differences successful objective treatment, and moreover basal differences successful nan diagnostic technology. In dermatology for example, tegument crab outcomes are worse for populations specified arsenic minorities, those pinch little socioeconomic status, aliases individuals pinch constricted healthcare access. While location is awesome committedness successful caller advances successful instrumentality learning (ML) and artificial intelligence (AI) to thief amended healthcare, this modulation from investigation to bedside must beryllium accompanied by a observant knowing of whether and really they effect wellness equity.

Health equity is defined by nationalist wellness organizations arsenic fairness of opportunity for everyone to beryllium arsenic patient arsenic possible. Importantly, equity whitethorn beryllium different from equality. For example, group pinch greater barriers to improving their wellness whitethorn require much aliases different effort to acquisition this adjacent opportunity. Similarly, equity is not fairness arsenic defined successful nan AI for healthcare literature. Whereas AI fairness often strives for adjacent capacity of nan AI exertion crossed different diligent populations, this does not halfway nan extremity of prioritizing capacity pinch respect to pre-existing wellness disparities.

Health equity considerations. An involution (e.g., an ML-based tool, indicated successful acheronian blue) promotes wellness equity if it helps trim existing disparities successful wellness outcomes (indicated successful lighter blue).

In “Health Equity Assessment of instrumentality Learning capacity (HEAL): a model and dermatology AI exemplary lawsuit study”, published successful The Lancet eClinicalMedicine, we propose a methodology to quantitatively measure whether ML-based wellness technologies execute equitably. In different words, does nan ML exemplary execute good for those pinch nan worst wellness outcomes for nan condition(s) nan exemplary is meant to address? This extremity anchors connected nan rule that wellness equity should prioritize and measurement exemplary capacity pinch respect to disparate wellness outcomes, which whitethorn beryllium owed to a number of factors that see structural inequities (e.g., demographic, social, cultural, political, economic, biology and geographic).

The wellness equity model (HEAL)

The HEAL model proposes a 4-step process to estimate nan likelihood that an ML-based wellness exertion performs equitably:

Identify factors associated pinch wellness inequities and specify instrumentality capacity metrics,
Identify and quantify pre-existing wellness disparities,
Measure nan capacity of nan instrumentality for each subpopulation,
Measure nan likelihood that nan instrumentality prioritizes capacity pinch respect to wellness disparities.

The last step’s output is termed nan HEAL metric, which quantifies really anticorrelated nan ML model’s capacity is pinch wellness disparities. In different words, does nan exemplary execute amended pinch populations that person nan worse wellness outcomes?

This 4-step process is designed to pass improvements for making ML exemplary capacity much equitable, and is meant to beryllium iterative and re-evaluated connected a regular basis. For example, nan readiness of wellness outcomes information successful measurement (2) tin pass nan prime of demographic factors and brackets successful measurement (1), and nan model tin beryllium applied again pinch caller datasets, models and populations.

Framework for Health Equity Assessment of instrumentality Learning capacity (HEAL). Our guiding rule is to debar exacerbating wellness inequities, and these steps thief america place disparities and measure for inequitable exemplary capacity to move towards amended outcomes for all.

With this work, we return a measurement towards encouraging definitive appraisal of nan wellness equity considerations of AI technologies, and promote prioritization of efforts during exemplary improvement to trim wellness inequities for subpopulations exposed to structural inequities that tin precipitate disparate outcomes. We should statement that nan coming model does not exemplary causal relationships and, therefore, cannot quantify nan existent effect a caller exertion will person connected reducing wellness result disparities. However, nan HEAL metric whitethorn thief place opportunities for improvement, wherever nan existent capacity is not prioritized pinch respect to pre-existing wellness disparities.

Case study connected a dermatology model

As an schematic lawsuit study, we applied nan model to a dermatology model, which utilizes a convolutional neural web akin to that described successful prior work. This illustration dermatology exemplary was trained to categorize 288 tegument conditions utilizing a improvement dataset of 29k cases. The input to nan exemplary consists of 3 photos of a tegument interest on pinch demographic accusation and a little system aesculapian history. The output consists of a classed database of imaginable matching tegument conditions.

Using nan HEAL framework, we evaluated this exemplary by assessing whether it prioritized capacity pinch respect to pre-existing wellness outcomes. The exemplary was designed to foretell imaginable dermatologic conditions (from a database of hundreds) based connected photos of a tegument interest and diligent metadata. Evaluation of nan exemplary is done utilizing a top-3 statement metric, which quantifies really often nan apical 3 output conditions lucifer nan astir apt information arsenic suggested by a dermatologist panel. The HEAL metric is computed via nan anticorrelation of this top-3 statement pinch wellness result rankings.

We utilized a dataset of 5,420 teledermatology cases, enriched for diverseness successful age, activity and race/ethnicity, to retrospectively measure nan model’s HEAL metric. The dataset consisted of “store-and-forward” cases from patients of 20 years aliases older from superior attraction providers successful nan USA and tegument crab clinics successful Australia. Based connected a reappraisal of nan literature, we decided to research race/ethnicity, activity and property arsenic imaginable factors of inequity, and utilized sampling techniques to guarantee that our information dataset had capable practice of each race/ethnicity, activity and property groups. To quantify pre-existing wellness outcomes for each subgroup we relied connected measurements from public databases endorsed by nan World Health Organization, specified arsenic Years of Life Lost (YLLs) and Disability-Adjusted Life Years (DALYs; years of life mislaid positive years lived pinch disability).

HEAL metric for each dermatologic conditions crossed race/ethnicity subpopulations, including wellness outcomes (YLLs per 100,000), exemplary capacity (top-3 agreement), and rankings for wellness outcomes and instrumentality performance.
(* Higher is better; measures nan likelihood nan exemplary performs equitably pinch respect to nan axes successful this table.)

HEAL metric for each dermatologic conditions crossed sexes, including wellness outcomes (DALYs per 100,000), exemplary capacity (top-3 agreement), and rankings for wellness outcomes and instrumentality performance. (* As above.)

Our study estimated that nan exemplary was 80.5% apt to execute equitably crossed race/ethnicity subgroups and 92.1% apt to execute equitably crossed sexes.

However, while nan exemplary was apt to execute equitably crossed property groups for crab conditions specifically, we discovered that it had room for betterment crossed property groups for non-cancer conditions. For example, those 70+ person nan poorest wellness outcomes related to non-cancer tegument conditions, yet nan exemplary didn't prioritize capacity for this subgroup.

HEAL metrics for each crab and non-cancer dermatologic conditions crossed property groups, including wellness outcomes (DALYs per 100,000), exemplary capacity (top-3 agreement), and rankings for wellness outcomes and instrumentality performance. (* As above.)

Putting things successful context

For holistic evaluation, nan HEAL metric cannot beryllium employed successful isolation. Instead this metric should beryllium contextualized alongside galore different factors ranging from computational ratio and information privateness to ethical values, and aspects that whitethorn power nan results (e.g., action bias aliases differences successful representativeness of nan information information crossed demographic groups).

As an adversarial example, nan HEAL metric tin beryllium artificially improved by deliberately reducing exemplary capacity for nan astir advantaged subpopulation until capacity for that subpopulation is worse than each others. For schematic purposes, fixed subpopulations A and B wherever A has worse wellness outcomes than B, see nan prime betwixt 2 models: Model 1 (M1) performs 5% amended for subpopulation A than for subpopulation B. Model 2 (M2) performs 5% worse connected subpopulation A than B. The HEAL metric would beryllium higher for M1 because it prioritizes capacity connected a subpopulation pinch worse outcomes. However, M1 whitethorn person absolute performances of conscionable 75% and 70% for subpopulations A and B respectively, while M2 has absolute performances of 75% and 80% for subpopulations A and B respectively. Choosing M1 complete M2 would lead to worse wide capacity for each subpopulations because immoderate subpopulations are worse-off while nary subpopulation is better-off.

Accordingly, nan HEAL metric should beryllium utilized alongside a Pareto condition (discussed further successful nan paper), which restricts exemplary changes truthful that outcomes for each subpopulation are either unchanged aliases improved compared to nan position quo, and capacity does not worsen for immoderate subpopulation.

The HEAL framework, successful its existent form, assesses nan likelihood that an ML-based exemplary prioritizes capacity for subpopulations pinch respect to pre-existing wellness disparities for circumstantial subpopulations. This differs from nan extremity of knowing whether ML will trim disparities successful outcomes crossed subpopulations successful reality. Specifically, modeling improvements successful outcomes requires a causal knowing of steps successful nan attraction travel that hap some earlier and aft usage of immoderate fixed model. Future investigation is needed to reside this gap.

Conclusion

The HEAL model enables a quantitative appraisal of nan likelihood that wellness AI technologies prioritize capacity pinch respect to wellness disparities. The lawsuit study demonstrates really to use nan model successful nan dermatological domain, indicating a precocious likelihood that exemplary capacity is prioritized pinch respect to wellness disparities crossed activity and race/ethnicity, but besides revealing nan imaginable for improvements for non-cancer conditions crossed age. The lawsuit study besides illustrates limitations successful nan expertise to use each recommended aspects of nan model (e.g., mapping societal context, readiness of data), frankincense highlighting nan complexity of wellness equity considerations of ML-based tools.

This activity is simply a projected attack to reside a expansive situation for AI and wellness equity, and whitethorn supply a useful information model not only during exemplary development, but during pre-implementation and real-world monitoring stages, e.g., successful nan shape of wellness equity dashboards. We clasp that nan spot of nan HEAL model is successful its early exertion to various AI devices and usage cases and its refinement successful nan process. Finally, we admit that a successful attack towards knowing nan effect of AI technologies connected wellness equity needs to beryllium much than a group of metrics. It will require a group of goals agreed upon by a organization that represents those who will beryllium astir impacted by a model.

Acknowledgements

The investigation described present is associated activity crossed galore teams astatine Google. We are grateful to each our co-authors: Terry Spitz, Malcolm Pyles, Heather Cole-Lewis, Ellery Wulczyn, Stephen R. Pfohl, Donald Martin, Jr., Ronnachai Jaroensri, Geoff Keeling, Yuan Liu, Stephanie Farquhar, Qinghan Xue, Jenna Lester, Cían Hughes, Patricia Strachan, Fraser Tan, Peggy Bui, Craig H. Mermel, Lily H. Peng, Yossi Matias, Greg S. Corrado, Dale R. Webster, Sunny Virmani, Christopher Semturs, Yun Liu, and Po-Hsuan Cameron Chen. We besides convey Lauren Winer, Sami Lachgar, Ting-An Lin, Aaron Loh, Morgan Du, Jenny Rizk, Renee Wong, Ashley Carrick, Preeti Singh, Annisah Um'rani, Jessica Schrouff, Alexander Brown, and Anna Iurchenko for their support of this project.