Melanoma is the deadliest form of skin cancer, and late-stage detection is strongly associated with fatal outcomes. Early identification is critical for reducing mortality and avoiding unnecessary invasive procedures such as surgical biopsy. This systematic review by Patel et al. (2023) evaluates whether AI-based techniques, combined with non-invasive imaging modalities, can improve the diagnostic accuracy of melanoma detection compared to current clinical standards.
Non-invasive imaging modalities: The review focuses on three key technologies. Reflectance confocal microscopy (RCM) uses a diode laser to produce high-resolution horizontal images at the cellular level, reaching as deep as the papillary dermis. Optical coherence tomography (OCT) uses near-infrared light to capture microscopic images up to 2 mm below the skin surface with resolution between 3 and 15 micrometers. Dermoscopy uses a dermatoscope with polarized or non-polarized light to visualize patterns and microstructures in the epidermis and superficial dermis.
The core problem: While these imaging modalities have shown significant efficacy, they require substantial training and expertise to interpret, leading to variable diagnostic accuracy across practitioners. Image quality also affects interpretation and diagnostic time. AI-based techniques aim to automate this process, providing objectivity, consistency, and speed. However, training data imbalances pose a serious concern: datasets are often composed primarily of fairer skin tones, potentially resulting in less accurate diagnoses for patients with darker skin.
Scope of the review: The authors included 40 studies published between 2018 and 2022 that applied AI-based algorithms to melanoma detection using dermoscopy, RCM, or OCT. The vast majority (37 of 40) focused on dermoscopic images, with 2 studies on RCM and only 1 on OCT, reflecting the relative maturity and public dataset availability for dermoscopy-based AI.
Database search: The authors conducted a systematic literature search across PubMed/Medline, Embase, and Cochrane for publications from 2018 to 2023. Search terms included "melanoma", "neural network", "machine or deep learning", "artificial intelligence", "dermoscopy", "reflectance confocal microscopy", "optical coherence tomography", and related clinical terminology. The review adhered to the 2020 PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, and the protocol was registered in the OSF database (registration number: osf-registrations-z8tve-v1).
Screening and selection: Two independent authors (R.H.P. and E.F.) screened and manually assessed all search results, with a third reviewer (J.L.) resolving any discrepancies. From an initial pool of 287 articles assessed for eligibility, 40 were ultimately included. Inclusion criteria required original, peer-reviewed English-language research that directly compared AI-based evaluation with human experts or histopathology for melanoma detection using dermoscopy, RCM, or OCT. Studies involving only lesion segmentation without classification, commentary or editorials, or those not reporting diagnostic accuracy, AUC, or sensitivity/specificity were excluded.
Performance metrics: The three primary metrics extracted were accuracy, sensitivity, and specificity. Accuracy measures the ratio of correctly classified lesions to total examined lesions, though the authors note it can be misleading when class distributions are imbalanced. Sensitivity captures the proportion of true melanoma cases correctly identified, while specificity captures the proportion of true non-melanoma cases correctly identified. Additional metrics such as AUC, positive predictive value (PPV), and negative predictive value (NPV) were also reported when available.
Limitations of the methodology: The authors were unable to perform a meta-analysis due to variability in performance metrics across studies. The QUADAS-2 tool for analyzing diagnostic accuracy studies could not be applied because of incomplete reporting in several reviewed papers. Since studies used different datasets, algorithms, and preprocessing pipelines, direct head-to-head comparisons between methods were not feasible.
The vast majority of the reviewed studies (37 of 40) applied deep learning to dermoscopic images. Most trained their algorithms on publicly available datasets, including the International Skin Imaging Collaboration (ISIC) series (ranging from 1,279 images in ISIC 2016 to 44,108 images in ISIC 2020), the PH2 dataset (200 images), and HAM10000 (10,015 images). Ground truth labels were confirmed by histopathology, follow-up examination, expert consensus, or in vivo confocal microscopy.
Top-performing algorithms: Foahom Gouabou et al. achieved AUROC of 0.93 for melanoma detection using a deep learning ensemble method on 1,113 dermoscopic images from ISIC 2018. The algorithm outperformed a trained dermatologist on challenging pigmented lesions (p = 0.90 for melanoma). Xin et al. proposed SkinTrans, a vision transformer network achieving 94.1% accuracy on clinical dermoscopic images. Singh et al. applied a segmentation model across four datasets (PH2, ISIC 2017, ISIC 2018, ISIC 2019), yielding accuracy scores of 99.50%, 99.33%, 98.56%, and 98.04%, respectively. Sayed et al. achieved 98.37% accuracy, 100% sensitivity, 96.47% specificity, and AUC of 99% on ISIC 2020 using a hybrid CNN with bald eagle search optimization.
AI vs. dermatologists: Several landmark studies directly compared algorithm performance to dermatologist panels. Marchetti et al. (2016) showed the top fusion algorithm achieved ROC of 0.86, significantly outperforming eight experienced dermatologists whose mean ROC was 0.71 (p = 0.001). In a follow-up study (2020), the best algorithm outperformed eight dermatologists and nine trainees (p < 0.001) with ROC of 0.87 vs. 0.74 and 0.66, respectively. Pham et al. trained on 17,302 images and outperformed all 157 dermatologists across 12 German university hospitals, achieving AUC of 94.4%, sensitivity of 85.0%, and specificity of 95.0%. Haenssle et al. showed their CNN achieved ROC AUC of 0.86 versus a mean of 0.79 for 58 dermatologists (p < 0.01). Brinker et al. demonstrated their algorithm outperformed 136 of 157 dermatologists in melanoma classification.
Specialized architectures: Naeem et al. developed SCDNet for multiclass skin cancer classification, achieving 92.18% accuracy and AUC of 0.9833 for melanoma. Lee et al. introduced Cancer-Net SCa with sensitivity of 92.8%, PPV of 78.5%, and NPV of 91.2%. Hagerty et al. used a fusion approach combining handcrafted image processing with ResNet50, achieving classification accuracy of 0.94 compared to 0.87 for deep learning alone. Nawaz et al. combined faster region-based CNNs (RCNN) with fuzzy k-means clustering (FKM), achieving average accuracy of 95.40%, 93.1%, and 95.6% across ISBI 2016, ISIC 2017, and PH2 datasets, and showing robustness to image artifacts like hair, blood vessels, and lighting variations.
Reflectance confocal microscopy (RCM) provides real-time in vivo visualization from the epidermis to the papillary dermis at a resolution comparable to histology. By enabling a "virtual biopsy" of skin lesions, RCM can reduce unnecessary invasive procedures and guide surgical excision margins. However, this method is inherently subjective and depends heavily on user interpretation, which creates variability in diagnostic outcomes. This makes RCM a natural candidate for AI augmentation.
Wodzinski et al. (2019): This study proposed a convolutional neural network (CNN) to classify skin lesions using RCM mosaics. The dataset consisted of 429 RCM mosaics divided into three classes: melanoma, basal cell carcinoma, and benign naevi. The test set classification accuracy reached 87%, which was higher than the accuracy achieved by medical confocal users. This system demonstrates the potential for early, non-invasive melanoma detection through automated RCM interpretation.
D'Alonzo et al. (2021): This study developed a weakly supervised deep neural network model for semantic segmentation of RCM mosaics, separating images into "benign" and "aspecific" (suspicious) regions. Working with 157 RCM mosaics, the model achieved an average AUC of 0.969 and a Dice coefficient of 0.778. This approach makes the diagnostic decision model more interpretable to clinicians by spatially localizing suspicious regions within the image, rather than providing a single classification label.
Despite these promising results, the number of studies applying AI to RCM remains very small (only 2 of 40 in this review). The limited availability of large, publicly accessible RCM datasets is a primary bottleneck. Unlike dermoscopy, where datasets like ISIC contain tens of thousands of images, RCM datasets are typically collected at individual clinical centers and remain small in scale.
Optical coherence tomography (OCT) is an interferometric imaging method that provides real-time views of the superficial skin layers using infrared broadband light, reaching depths of 1 to 2 mm with resolution between 3 and 15 micrometers. While OCT has proven useful for diagnosing basal cell carcinoma and other keratinocyte carcinomas, its application to malignant melanoma remains in early stages. Initial studies have shown OCT can detect diagnostic features of melanoma such as epidermal psoriasiform hyperplasia, melanocytic nests, and vertical icicle-shaped structures, but its sensitivity and specificity as a standalone tool are not yet convincing.
Vibrational OCT (VOCT): Traditional OCT has not yet been applied to AI-assisted melanoma diagnosis. However, a variant called vibrational OCT was described in one paper. VOCT combines traditional OCT with a mechanical vibration stimulus to measure the vibrational properties of tissue structures, providing additional information about tissue composition and function. This technique can differentiate between tissue features such as the presence of collagen or elastin, which can help distinguish melanoma from benign moles.
Silver et al. (2022): This study explored the use of VOCT with machine learning to differentiate between normal skin and different skin cancers non-invasively, using a dataset of 80 images. The machine learning algorithm, combined with the height and location of VOCT mechanovibrational peaks, achieved sensitivity of 83.3% and specificity of 77.8% in differentiating between normal skin and cancerous lesions, including melanoma. While these results are promising, the dataset was very small and the technique has yet to be replicated at scale.
Angiographic OCT: The review also notes that angiographic OCT shows potential for melanoma diagnosis and staging, as it can detect early changes in vessel morphology during the transition from dysplastic nevi to melanoma. However, this approach has not yet been combined with AI-based analysis.
All 40 studies included in the review demonstrated robust performance of AI-based algorithms in melanoma identification. In studies directly comparing AI to dermatologists on dermoscopic images, AI-based algorithms consistently achieved ROC above 80%, with a mean algorithm sensitivity of 83.01% and a mean algorithm specificity of 85.58%. For context, Phillips et al. conducted a meta-analysis showing that primary care physicians achieved AUC of 0.83 with sensitivity of 79.9% and specificity of 70.9%, while dermatologists achieved AUC of 0.91 with sensitivity of 87.5% and specificity of 81.4%. AI algorithms are therefore approaching or exceeding dermatologist-level performance.
Dataset bias: A critical issue identified across the literature is the composition of training data. Most studies used the ISIC publicly available datasets, which primarily encompass lesion data from light-skinned patients in the United States, Europe, and Australia, with reduced representation of skin lesions from Asian or darker-skinned populations. This creates a fundamental bias: algorithms trained predominantly on fair-skinned patient data may produce less accurate diagnoses for patients with darker skin tones. The review emphasizes that future dataset development must expand to include images from all skin tones and ethnicities.
Patient perceptions: One cited study found that patients appeared receptive to AI use in skin cancer screening, provided the physician-patient relationship was preserved. This suggests that the barrier to clinical adoption may be less about patient acceptance and more about achieving sufficient algorithm validation, regulatory clearance, and workflow integration. The integration of AI into clinical platforms could particularly benefit rural communities, which already experience disparities in melanoma incidence and higher mortality in parts of the United States.
Lack of standardization: Because the 40 reviewed studies used different datasets, different deep learning architectures, and different image preprocessing pipelines, direct comparison of algorithm efficacy was not feasible. There are no standardized reporting or evaluation frameworks in place for AI-based melanoma detection. The QUADAS-2 tool, commonly used to assess diagnostic accuracy study quality, could not be applied due to incomplete reporting in several papers. This lack of standardization makes it difficult to determine which approaches are truly superior.
Skin tone and generalizability: The majority of studies focused on patients with fairer skin types. Translating these findings to darker-skinned populations remains a significant challenge. Artifacts such as the presence of hair in images, skin texture differences, and variations in image acquisition conditions (zoom level, focus, lighting, surgical ink markings) can all introduce bias and degrade algorithm accuracy. An imbalanced dataset during neural network training leads to uneven performance when the model encounters real-world clinical diversity.
Publication bias: The review may be subject to publication bias, as studies reporting favorable AI results are more likely to be published. Studies with lower accuracy rates or higher false positive and false negative rates are less likely to appear in the literature, potentially inflating the perceived performance of AI in melanoma detection.
Limited modality coverage: Studies directly comparing AI-based algorithms to dermatologists were relatively few, and those applying AI to OCT and RCM were severely limited (1 and 2 studies, respectively). The small sample sizes in the OCT study (80 images) and RCM studies (429 and 157 mosaics) mean that these results, while encouraging, cannot be considered definitive. The scarcity of public datasets for OCT and RCM remains a bottleneck for AI development in these modalities.
Ethical implications: The integration of AI into dermatology raises important ethical, legal, and privacy concerns. AI models trained mainly on European or East Asian populations carry a built-in diagnostic bias against darker-skinned patients. Imbalanced datasets may output incorrect results, with potentially serious consequences if treatment or surgery is undertaken based on flawed AI recommendations. The review stresses that AI should serve as a supplement or diagnostic aid, not as a replacement for board-certified dermatologists. Patients should be informed and educated about AI involvement in their diagnosis.
Expanding dataset diversity: Future development of public datasets must prioritize the inclusion of dermoscopic, RCM, and OCT images from all skin tones and ethnicities. This is essential not only for algorithmic fairness but also for realizing the full potential of AI in countries that may lack easy access to dermatological care. Achieving diverse, representative datasets will require multi-center, international collaboration and standardized image acquisition protocols. Sharing of code, models, and image datasets will also improve reproducibility.
Standardization and FDA guidance: The authors call for standardized reporting and evaluation methods across AI melanoma detection studies. This includes consistent performance metric reporting, transparent model descriptions, and uniform preprocessing standards. Food and Drug Administration (FDA) guidance will be necessary to ensure safe and effective clinical implementation. The multitude of AI techniques used to analyze dermatological images currently complicates clinical decision-making without clear guidelines on when and how to use specific algorithms.
Bridging the access gap: AI integration into clinical platforms could help reach underserved populations such as rural communities, which already suffer from melanoma incidence disparities and higher mortality. Combining AI with dermoscopy and visual inspection can make diagnosis more efficient and accessible in areas without ready access to dermatologists. Smartphone-based applications, such as the one evaluated by Phillips et al. which achieved ROC of 91.8% on smartphone-captured images, represent a promising pathway toward democratized screening.