SCIN: A new resource for representative dermatology images

Trending 1 month ago
Source

Health datasets play a important domiciled successful investigation and aesculapian education, but it tin beryllium challenging to create a dataset that represents nan existent world. For example, dermatology conditions are divers successful their quality and severity and manifest otherwise crossed tegument tones. Yet, existing dermatology image datasets often deficiency practice of mundane conditions (like rashes, allergies and infections) and skew towards lighter tegument tones. Furthermore, title and ethnicity accusation is often missing, hindering our expertise to measure disparities aliases create solutions.

To reside these limitations, we are releasing nan Skin Condition Image Network (SCIN) dataset successful collaboration pinch physicians astatine Stanford Medicine. We designed SCIN to bespeak nan wide scope of concerns that group hunt for online, supplementing nan types of conditions typically recovered successful objective datasets. It contains images crossed various tegument tones and assemblage parts, helping to guarantee that early AI devices activity efficaciously for all. We've made the SCIN dataset freely disposable arsenic an open-access assets for researchers, educators, and developers, and person taken observant steps to protect contributor privacy.

Example group of images and metadata from nan SCIN dataset.

Dataset composition

The SCIN dataset presently contains complete 10,000 images of skin, nail, aliases hairsbreadth conditions, straight contributed by individuals experiencing them. All contributions were made voluntarily pinch informed consent by individuals successful nan US, nether an institutional-review committee approved study. To supply discourse for retrospective dermatologist labeling, contributors were asked to return images some close-up and from somewhat further away. They were fixed nan action to self-report demographic accusation and tanning propensity (self-reported Fitzpatrick Skin Type, i.e., sFST), and to picture nan texture, long and symptoms related to their concern.

One to 3 dermatologists branded each publication pinch up to 5 dermatology conditions, on pinch a assurance people for each label. The SCIN dataset contains these individual labels, arsenic good arsenic an aggregated and weighted differential test derived from them that could beryllium useful for exemplary testing aliases training. These labels were assigned retrospectively and are not balanced to a objective diagnosis, but they let america to comparison nan distribution of dermatology conditions successful nan SCIN dataset pinch existing datasets.

The SCIN dataset contains mostly allergic, inflammatory and infectious conditions while datasets from objective sources attraction connected benign and malignant neoplasms.

While galore existing dermatology datasets attraction connected malignant and benign tumors and are intended to assistance pinch tegument crab diagnosis, nan SCIN dataset consists mostly of communal allergic, inflammatory, and infectious conditions. The mostly of images successful nan SCIN dataset show early-stage concerns — much than half arose little than a week earlier nan photo, and 30% arose little than a time earlier nan image was taken. Conditions wrong this clip model are seldom seen wrong nan wellness strategy and truthful are underrepresented successful existing dermatology datasets.

We besides obtained dermatologist estimates of Fitzpatrick Skin Type (estimated FST aliases eFST) and layperson labeler estimates of Monk Skin Tone (eMST) for nan images. This allowed comparison of nan tegument information and tegument type distributions to those successful existing dermatology datasets. Although we did not selectively target immoderate tegument types aliases tegument tones, nan SCIN dataset has a balanced Fitzpatrick tegument type distribution (with much of Types 3, 4, 5, and 6) compared to akin datasets from objective sources.

Self-reported and dermatologist-estimated Fitzpatrick Skin Type distribution successful nan SCIN dataset compared pinch existing un-enriched dermatology datasets (Fitzpatrick17k, PH², SKINL2, and PAD-UFES-20).

The Fitzpatrick Skin Type standard was primitively developed arsenic a photo-typing standard to measurement nan consequence of tegument types to UV radiation, and it is wide utilized successful dermatology research. The Monk Skin Tone standard is simply a newer 10-shade standard that measures tegument reside alternatively than tegument phototype, capturing much nuanced differences betwixt nan darker tegument tones. While neither standard was intended for retrospective estimation utilizing images, nan inclusion of these labels is intended to alteration early investigation into tegument type and reside practice successful dermatology. For example, nan SCIN dataset provides an first benchmark for nan distribution of these tegument types and tones successful nan US population.

The SCIN dataset has a precocious practice of women and younger individuals, apt reflecting a operation of factors. These could see differences successful tegument information incidence, propensity to activity wellness accusation online, and variations successful willingness to lend to investigation crossed demographics.


Crowdsourcing method

To create nan SCIN dataset, we utilized a caller crowdsourcing method, which we picture successful nan accompanying research paper co-authored pinch investigators astatine Stanford Medicine. This attack empowers individuals to play an progressive domiciled successful healthcare research. It allows america to scope group astatine earlier stages of their wellness concerns, perchance earlier they activity general care. Crucially, this method uses advertisements connected web hunt consequence pages — nan starting constituent for galore people’s wellness travel — to link pinch participants.

Our results show that crowdsourcing tin output a high-quality dataset pinch a debased spam rate. Over 97.5% of contributions were genuine images of tegument conditions. After performing further filtering steps to exclude images that were retired of scope for nan SCIN dataset and to region duplicates, we were capable to merchandise astir 90% of nan contributions received complete nan 8-month study period. Most images were crisp and well-exposed. Approximately half of nan contributions see self-reported demographics, and 80% incorporate self-reported accusation relating to nan tegument condition, specified arsenic texture, duration, aliases different symptoms. We recovered that dermatologists’ expertise to retrospectively delegate a differential test depended much connected nan readiness of self-reported accusation than connected image quality.

Dermatologist assurance successful their labels (scale from 1-5) depended connected nan readiness of self-reported demographic and denotation information.

While cleanable image de-identification tin ne'er beryllium guaranteed, protecting nan privateness of individuals who contributed their images was a apical privilege erstwhile creating nan SCIN dataset. Through informed consent, contributors were made alert of imaginable re-identification risks and advised to debar uploading images pinch identifying features. Post-submission privateness protection measures included manual redaction aliases cropping to exclude perchance identifying areas, reverse image searches to exclude publically disposable copies and metadata removal aliases aggregation. The SCIN Data Use License prohibits attempts to re-identify contributors.

We dream nan SCIN dataset will beryllium a adjuvant assets for those moving to beforehand inclusive dermatology research, education, and AI instrumentality development. By demonstrating an replacement to accepted dataset creation methods, SCIN paves nan measurement for much typical datasets successful areas wherever self-reported information aliases retrospective labeling is feasible.


Acknowledgements

We are grateful to each our co-authors Abbi Ward, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, Pradeep Kumar S, Tiya Tiyasirisokchai, Sunny Virmani, Renee Wong, Yossi Matias, Greg S. Corrado, Dale R. Webster, Dawn Siegel (Stanford Medicine), Steven Lin (Stanford Medicine), Justin Ko (Stanford Medicine), Alan Karthikesalingam and Christopher Semturs. We besides convey Yetunde Ibitoye, Sami Lachgar, Lisa Lehmann, Javier Perez, Margaret Ann Smith (Stanford Medicine), Rachelle Sico, Amit Talreja, Annisah Um’rani and Wayne Westerlind for their basal contributions to this work. Finally, we are grateful to Heather Cole-Lewis, Naama Hammel, Ivor Horn, Michael Howell, Yun Liu, and Eric Teasley for their insightful comments connected nan study creation and manuscript.

More