Artificial Intelligence and New Technologies in Melanoma Diagnosis: A Narrative Review

Cancers 2025 AI 9 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-5
What This Review Covers and How It Was Conducted

Scope: This narrative review surveys how artificial intelligence and advanced imaging technologies have transformed melanoma diagnosis during the critical 2020 to 2025 translational period. The authors synthesize progress across four domains: algorithmic evolution (from CNNs to Vision Transformers and foundation models), integration with non-invasive imaging (reflectance confocal microscopy, high-frequency ultrasound, optical coherence tomography, and 3D total body photography), regulatory developments (the EU AI Act and FDA guidance on adaptive systems), and translational barriers related to dataset bias and clinical workflow alignment.

Search strategy: A structured literature search was conducted from November 2024 to October 2025 across PubMed/MEDLINE, Scopus, ScienceDirect, IEEE Xplore, Nature Portfolio, and the arXiv preprint server. Regulatory documents were retrieved from the U.S. FDA and European Commission portals. From 1,246 initial records, 713 titles and abstracts were assessed after deduplication, 162 full texts were reviewed, and 98 studies met all inclusion criteria for the final synthesis.

Clinical context: Melanoma accounts for only 1% of skin cancers but causes most skin cancer deaths. In 2022, GLOBOCAN reported 331,722 new cases and 58,667 deaths worldwide. Localized melanoma has a five-year survival rate above 99%, but metastatic disease drops to around 30%. Early detection remains the most important factor for survival, making AI-assisted diagnosis a high-impact clinical priority.

TL;DR: This narrative review of 98 studies (2020-2025) examines how AI and advanced imaging are reshaping melanoma diagnosis, covering algorithm evolution, imaging integration, regulatory frameworks, and translational challenges.
Pages 6-7
Convolutional Neural Networks: The Foundation of Melanoma AI

Architecture baselines: In the early 2020s, Convolutional Neural Networks (CNNs) dominated melanoma image classification. Standard architectures such as ResNet-50, EfficientNet-B4, Inception-v3, DenseNet, and MobileNet were benchmarked on reference datasets like HAM10000 (10,015 dermoscopic images) and ISIC challenge sets. Many of these models achieved diagnostic accuracies exceeding 95% and AUC values of 0.94 to 0.96 on dermoscopic image classification tasks.

How CNNs work: CNNs use layers of small mathematical filters (convolutional kernels) that slide across an image to detect patterns like edges, textures, and color gradients. Each layer builds on the previous one, moving from simple features to complex structures. For melanoma, this means the network learns to recognize features clinicians use, such as border irregularity, pigment patterns, and vascular structures visible in dermoscopy.

Inherent limitations: Despite strong benchmark results, CNNs are constrained by their local receptive fields. Each filter only "sees" a small patch of the image at a time, which limits the model's ability to capture long-range spatial relationships. In dermatology, clinical judgment depends on holistic features like overall lesion symmetry, color uniformity across the entire lesion, and contextual comparison with neighboring moles. These limitations drove the research community toward attention-based architectures capable of modeling global image context.

TL;DR: CNNs like ResNet-50 and EfficientNet-B4 achieved AUC 0.94-0.96 on melanoma detection benchmarks, but their inability to capture whole-lesion context pushed the field toward transformer architectures.
Pages 7-8
Vision Transformers and Hybrid Models: Capturing the Full Picture

The ViT revolution: Vision Transformers (ViTs) brought the self-attention mechanism from natural language processing into medical imaging. Unlike CNNs, ViTs divide an image into patches and compute attention scores between every pair of patches, enabling the model to capture long-range dependencies across the entire image. Specialized architectures like DermViT and EViT-Dens169 used hierarchical attention and multi-scale feature pyramids to achieve accuracies up to 97% on ISIC datasets, with AUC values of 0.96 to 0.98, while reducing parameter count by nearly 40% compared to equivalent CNNs.

Hybrid CNN-ViT models: Architectures such as ConvNeXt and SkinSwinViT combine the strengths of both approaches. They use convolutional layers for efficient local feature extraction (edges, textures) and transformer layers for global contextual reasoning (symmetry, overall lesion shape). These hybrids achieved AUC values of 0.97 to 0.98 on datasets like BCN_20000 (18,946 images) and MILK10k, offering the best balance between accuracy and computational efficiency for real-time or resource-limited clinical environments.

Why ViTs outperform CNNs: The self-attention mechanism provides a global receptive field, allowing the model to capture symmetry, border irregularity, and color variegation features central to the ABCDE diagnostic criteria. ViTs are also more robust to variable lighting, sensor differences, and imaging artifacts because of their patch-based representation. Large-scale pre-training on non-dermatologic datasets improves generalization when fine-tuning on the relatively small datasets common in dermatology.

Meta-analysis confirmation: Meta-analyses from 2023 to 2025 confirmed the maturity of these architectures, reporting average AUC values between 0.96 and 0.98 in large-scale evaluations. These studies consistently found that modern AI systems perform comparably to or better than experienced dermatologists in melanoma detection.

TL;DR: Vision Transformers (DermViT, EViT-Dens169) achieved AUC 0.96-0.98 with 40% fewer parameters than CNNs. Hybrid CNN-ViT models like SkinSwinViT combine local and global reasoning for the best accuracy-efficiency balance.
Pages 8-9
AI Versus Dermatologists: Head-to-Head Performance

Pooled benchmarks: Systematic reviews from 2023 to 2025 demonstrate that modern AI systems achieve pooled sensitivity of approximately 86.3% and specificity of 78.4% for melanoma detection. By comparison, generalist clinicians (non-dermatologists) showed markedly lower performance with sensitivity of 64.6% and specificity of 72.8%. This gap underscores the primary clinical value of AI as an augmentation tool for non-specialists in primary care and teledermatology, rather than a replacement for expert dermatologists.

Landmark prospective trial: A multicenter prospective study by Heinlein et al. (2024) demonstrated that an AI classifier achieved higher sensitivity (92.1%) than expert dermatologists (73.4%) but at the expense of lower specificity (67.3% vs. 82.8%). The critical finding was that combined human-AI evaluation achieved the best overall diagnostic balance, with specificity reaching 90.7%. This confirms that collaborative human-AI decision-making yields the highest diagnostic reliability and reproducibility.

AI-assisted improvement: Prospective studies show that AI-assisted decision support can increase non-specialist diagnostic performance by 10 to 15 percentage points and reduce mismanagement of malignant lesions from nearly 60% to less than 5%. These findings position AI as a powerful triage and decision-support tool, especially in settings where access to dermatologists is limited.

TL;DR: AI achieves 92.1% sensitivity vs. 73.4% for expert dermatologists alone, but combined human-AI review reaches 90.7% specificity. AI boosts non-specialist accuracy by 10-15 percentage points.
Pages 9-11
Training Datasets and the Problem of Algorithmic Bias

Core benchmark datasets: Most melanoma AI research relies on a handful of public datasets. HAM10000 contains 10,015 dermoscopic images with over 50% biopsy confirmation but suffers from class imbalance and light-skin bias. BCN_20000 has 18,946 dermoscopic images with 100% biopsy confirmation for malignancies but limited Fitzpatrick skin type diversity. The ISIC challenge datasets (2018-2020) aggregate over 157,000 dermoscopic images but have heterogeneous annotation quality. The newer SLICE-3D (ISIC 2024) introduces 400,000+ 3D total body photography crops but has limited clinical validation so far.

The skin-of-color gap: Most public dermatology datasets remain dominated by lighter Fitzpatrick skin types (I-III), leading to systematic performance disparities. The Fitzpatrick_17k dataset (16,577 clinical photos) was created to quantify this imbalance, while the DDI dataset (656 images) directly prioritized phototype diversity. Reviews confirm that AI diagnostic accuracy decreases significantly for darker skin tones. In the United States, five-year melanoma survival is 94% among White patients but only 70% among Black patients, a disparity that biased AI could worsen.

Regulatory response: This issue has reached regulatory attention. In the FDA's 2024 De Novo authorization of DermaSensor, the agency explicitly required demographic subgroup analysis and issued cautionary guidance for Fitzpatrick IV-VI populations due to insufficient sensitivity data. The EU AI Act mandates bias monitoring, representative training datasets, and transparency regarding demographic performance variation.

Mitigation strategies: Addressing bias requires more than simply collecting more diverse images. Effective approaches include domain adaptation, rebalancing techniques, synthetic augmentation of underrepresented lesion subtypes (acral and amelanotic melanoma), fairness constraints during model optimization, and federated learning across institutions with heterogeneous patient populations.

TL;DR: Most AI training data overrepresents lighter skin tones, causing worse performance on darker skin. The FDA and EU now mandate demographic fairness analysis, and mitigation requires domain adaptation, synthetic augmentation, and federated learning.
Pages 11-13
Beyond Dermoscopy: AI Meets Advanced Imaging Technologies

Reflectance confocal microscopy (RCM): RCM provides near-histologic, in vivo resolution of skin architecture at the cellular level. AI-assisted RCM algorithms can automatically delineate the dermal-epidermal junction and classify cellular morphology, reducing unnecessary biopsies by over 50%. These systems routinely achieve AUC values near 0.97, positioning AI-enhanced RCM as a realistic path toward "digital biopsy" for clinically equivocal lesions that would otherwise require surgical excision.

Optical coherence tomography (OCT) and high-frequency ultrasound (HFUS): AI-enhanced OCT and HFUS imaging allow non-invasive measurement of tumor depth and margin visualization. These modalities correlate strongly with histopathologic Breslow thickness (correlation coefficient r = 0.88-0.94), the key prognostic measurement for melanoma staging. This extends AI applications from purely diagnostic classification to preoperative staging and risk assessment, helping surgeons plan appropriate excision margins without invasive procedures.

3D total body photography (3D TBP): AI-supported serial 3D TBP enables automated detection of new or changing lesions in high-risk patients undergoing longitudinal surveillance. Early studies show algorithmic monitoring improves early melanoma detection rates by approximately 10% compared to manual clinical review. This establishes a paradigm for population-level surveillance where AI tracks subtle changes across hundreds of moles over time.

Hyperspectral imaging (HSI): HSI captures reflectance spectra across tens to hundreds of narrow wavelength bands, revealing biochemical and microstructural skin properties invisible to standard dermoscopy or RGB photography. The Spectrum-Aided Vision Enhancer (SAVE) system demonstrated reliable detection of acral lentiginous melanoma, melanoma in situ, nodular melanoma, and superficial spreading melanoma. HSI is especially promising for acral and amelanotic melanomas, two subtypes frequently missed by both clinicians and standard AI systems.

TL;DR: AI combined with RCM achieves AUC 0.97 and cuts unnecessary biopsies by 50%. HFUS/OCT correlates with Breslow thickness (r=0.88-0.94) for staging. 3D body photography and hyperspectral imaging add longitudinal monitoring and biochemical detection capabilities.
Pages 13-14
Foundation Models and Multimodal AI: The New Frontier

Foundation models: The most recent phase of AI development (2024-2025) is defined by large-scale, general-purpose foundation models. Frameworks such as PanDerm and DermINO were pre-trained on millions of unlabeled dermatologic images using self-supervised and hybrid learning paradigms. When fine-tuned with limited labeled data, these models outperform both conventional CNNs and transformer-based systems, achieving AUC values of 0.97 to 0.99 with the highest generalizability across institutions and patient populations of any architecture tested.

Multimodal integration: Multimodal fusion networks combine dermoscopic images, clinical photographs, histopathologic data, and patient metadata (age, sex, lesion location) into a single diagnostic framework. Datasets like MRA-MIDAS and MILK10k (5,240 training + 479 test images with clinical, dermoscopic, and metadata inputs) have enabled integrative modeling that links imaging features with molecular and clinical information. These multimodal systems achieve AUC values of 0.95 to 0.98 with improved robustness and explainability compared to single-modality models.

Vision-language models: CLIP-inspired architectures align visual features with dermatologic terminology, improving both interpretability and transparency. For example, these models can explain their predictions using clinical language rather than abstract feature maps, making it easier for dermatologists to understand and trust AI outputs. This represents a paradigm shift toward explainable, federated, and clinically scalable AI systems.

TL;DR: Foundation models like PanDerm and DermINO achieve the highest AUC (0.97-0.99) with best cross-institution generalization. Multimodal systems fusing dermoscopy, clinical data, and metadata outperform single-image approaches in robustness.
Pages 14-17
Regulatory Landscape: FDA Approvals and the EU AI Act

FDA-authorized devices: DermaSensor became the first AI-enabled dermatologic device cleared for primary care physicians (De Novo classification, 2024). Based on elastic scattering spectroscopy coupled with machine learning, the DERM-ASSESS III trial showed sensitivity of 95.5% and negative predictive value of 98.1%, but specificity was very low (20.7-32.5%). This means up to four out of five benign lesions flagged by clinicians are still identified as "refer" by the device. SkinVision, a smartphone-based CNN for self-screening, is CE-marked as a class IIa device in the EU with sensitivity of 92.1% and specificity of 80.1%.

FDA Predetermined Change Control Plan (PCCP): Finalized in late 2024, the PCCP allows manufacturers of AI-enabled Software as a Medical Device (SaMD) to predefine algorithmic modifications, such as retraining on new data or recalibrating thresholds, without requiring a new regulatory submission for each update. This framework marks a pivotal shift from static regulation to dynamic oversight of adaptive AI systems, provided that manufacturers specify verification steps, demographic subgroup analyses, and drift monitoring mechanisms.

EU Artificial Intelligence Act (2024): The EU AI Act classifies AI-based medical diagnostic systems as "high-risk," requiring developers to implement comprehensive quality management systems, algorithmic transparency, data governance standards, logging requirements, continuous monitoring, and mechanisms for mandatory human oversight. These requirements pose significant operational challenges, particularly for smaller developers lacking resources for continuous monitoring and cybersecurity audits.

Unresolved liability questions: As AI systems approach or exceed dermatologist-level performance, responsibility for diagnostic errors becomes increasingly ambiguous. Liability may be distributed across clinicians, institutions, manufacturers, and model maintainers. Clear legal frameworks defining performance thresholds and responsibility allocation are still lacking and represent a major barrier to widespread deployment.

TL;DR: DermaSensor (FDA 2024) achieves 95.5% sensitivity but only 20.7-32.5% specificity. The FDA PCCP enables adaptive AI updates without full resubmission. The EU AI Act classifies medical AI as high-risk with strict transparency and oversight requirements.
Pages 17-22
Challenges, Emerging Technologies, and the Path Forward

Explainable AI (XAI): As algorithmic complexity increases, explainability frameworks such as LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), and Grad-CAM (Gradient-weighted Class Activation Mapping) have become central to clinical acceptance. These tools visualize which image features contributed most to a model's prediction, helping clinicians distinguish genuine diagnostic cues from spurious correlations. Explainability is now a prerequisite for regulatory approval under both the FDA SaMD framework and the EU AI Act.

Federated learning: Federated learning allows distributed model training across multiple institutions without direct data exchange, ensuring compliance with GDPR and HIPAA privacy regulations. Empirical evidence suggests federated models often outperform centrally trained counterparts on external datasets because they are exposed to more heterogeneous data distributions. This simultaneously improves robustness, mitigates algorithmic bias, and enables global collaboration without compromising patient privacy.

Remaining structural barriers: Most published AI studies remain retrospective, single-center, and based on homogeneous populations, limiting external validity. Seamless integration into Electronic Health Record (EHR) systems via DICOM and FHIR standards remains a persistent technical barrier. Even highly accurate models are underutilized without proper workflow integration, clinician training on AI limitations, and understanding of uncertainty estimates.

Emerging frontiers: Speculative but promising directions include diffusion models for rare lesion augmentation (improving performance on underrepresented melanoma subtypes), wearable microneedle biosensors for continuous AI-driven monitoring, smartphone-based spectroscopy for high-risk populations, and large multimodal foundation models integrating clinical notes, dermoscopy, histopathology, and genomics for holistic patient-level reasoning. The goal is transforming melanoma management from episodic screening to proactive, individualized real-time surveillance.

TL;DR: Key priorities include explainable AI for clinical trust, federated learning for privacy-preserving global collaboration, EHR integration via DICOM/FHIR standards, and emerging technologies like wearable biosensors and genomics-integrated foundation models.