Melanoma accounts for only about 4% of all skin cancers, yet it is responsible for roughly 75% of all skin cancer-related deaths, according to the American Cancer Society. Survival data show that patients typically live between 5 and 10 years following a melanoma diagnosis, making early detection critical. Clinicians traditionally rely on the ABCDE rule (asymmetry, borders, color, diameter, and evolution) to evaluate suspicious moles, but this approach depends heavily on individual expertise and is subject to interpretive variability.
The dermoscopy gap: Before dermoscopy images became available, even experienced dermatologists achieved only about 60% diagnostic accuracy for skin cancer. Dermoscopy raised that rate to between 75% and 84%, but manual interpretation remains time-consuming and error-prone due to the complexity of skin lesion patterns. Pathology via biopsy provides more definitive answers, but it is invasive, costly, and slow. This has created strong demand for automated diagnostic systems that can deliver objective, efficient, and accurate readings of dermoscopy images.
Enter machine learning and deep learning: Convolutional neural networks (CNNs) can learn from complex visual data, identifying patterns that human observers may miss. Techniques such as transfer learning and federated learning have been increasingly applied to help dermatologists classify melanoma lesions. Computer-aided diagnostic (CAD) tools built on these methods save both time and effort compared with traditional clinical approaches.
This systematic review examines 34 studies published between 2016 and 2024 that apply machine learning (ML) and deep learning (DL) to melanoma diagnosis and prognosis from dermoscopy images. The review synthesizes findings across model architectures, datasets, and evaluation metrics, and identifies challenges related to data diversity, model interpretability, and computational demands.
The authors followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. They searched six databases: PubMed, Science Direct, Springer, Frontiers, IEEE, and MDPI for English-language articles published between 2016 and 2024. The search strategy combined keywords including "melanoma," "deep learning OR machine learning," "dermoscopy images," and "diagnosis OR prognosis," explicitly excluding review articles. This initial search returned 12,657 articles.
Screening and selection: After removing duplicates, 11,528 records remained. Applying inclusion criteria narrowed the pool to 170 full-text articles. Further exclusion criteria (non-English, not ISI-indexed, not focused on melanoma with dermoscopy images, not using ML or DL) reduced the final set to 34 studies. The distribution across databases was notable: MDPI contributed 12 selected studies (35%), Springer contributed 10 (29%), Science Direct contributed 5 (15%), IEEE contributed 5, PubMed contributed 1, and Frontiers contributed 1.
Temporal distribution: The highest frequency of selected studies came from 2020 and 2022, reflecting growing research attention over time. The review addressed six structured research questions covering the ML/DL techniques used, preprocessing and feature engineering approaches, dataset curation, study type (classification, diagnosis, analysis, prediction, identification), performance metrics, and whether melanoma prognosis was addressed.
The authors defined six specific review questions to frame their analysis, ranging from which techniques and architectures were employed, to how dermoscopy images were preprocessed, to what evaluation metrics were applied. This structured approach ensures systematic coverage across all 34 included studies.
ISIC (International Skin Imaging Collaboration): The most widely used benchmark family. The ISIC datasets grew substantially over the years: ISIC 2016 contained 900 images (173 melanoma), ISIC 2017 had 2,000 images (374 melanoma), ISIC 2018 expanded to 12,970 images (2,594 melanoma), ISIC 2019 reached 25,331 images (4,522 melanoma), and ISIC 2020 contained 33,126 images (584 melanoma). These datasets include high-resolution dermoscopy images along with metadata such as lesion type, patient age, and lesion location.
HAM10000: This dataset contains 10,015 dermoscopy images (1,113 melanoma) captured under controlled conditions using a dermatoscope. Images are labeled with diagnostic information including melanoma, nevi, seborrheic keratosis, and basal cell carcinoma. The images vary in resolution and quality, reflecting real-world clinical conditions with artifacts, blurriness, and uneven illumination. HAM10000 and ISIC are identified as the two most widely used benchmarks across the reviewed studies.
Smaller specialized datasets: PH2 contains 200 dermoscopy images (40 melanoma) at 768 x 560 pixel resolution captured with a DermLite II Pro dermoscope at 10x magnification. MedNode consists of 170 images (70 melanoma, 100 nevus) from the University Medical Center Groningen. DermIS contains 1,000 images (500 benign, 500 malignant) at 600 x 450 pixels. DermQuest provides labeled images spanning skin cancer, eczema, and psoriasis. DermPK contains 157 images from the Multan Institute of Nuclear Medicine and Radiotherapy in Pakistan.
The review found that most studies achieving above 90% accuracy used datasets with fewer than 1,000 images, and no dataset used in the reviewed studies exceeded 5,000 images. The authors flag this as a serious concern, noting that high accuracy on small datasets likely reflects overfitting rather than genuine generalization, particularly for clinical decision-making about melanoma diagnosis and prognosis.
Machine learning algorithms process dermoscopy images by extracting handcrafted features such as asymmetry, border irregularity, color variation, and texture patterns, often aligned with the clinical ABCDE rule. Image processing techniques including edge detection and segmentation are used to enhance quality and highlight diagnostic features before classification.
Key algorithms: The reviewed studies employed Support Vector Machines (SVM), Random Forests, Logistic Regression (LR), Linear Discriminant Analysis (LDA), k-Nearest Neighbors (KNN), Decision Tree Classifiers (CART), and Gaussian Naive Bayes. Logistic Regression is particularly suited for binary classification (melanoma vs. benign) in biomedical settings, calculating class membership probabilities. LDA produces linear decision boundaries and is commonly used for supervised pattern classification. KNN relies on proximity-based classification, assigning categories based on closeness to similar data points.
XG-Boost stood out for accuracy: Among ML methods, XG-Boost achieved 97.22% accuracy on a collected dataset, the highest among traditional ML approaches. However, the review found a critical limitation: XG-Boost had very low sensitivity (12.60%), meaning it failed to correctly identify most actual melanoma cases despite its high overall accuracy. This discrepancy highlights the danger of relying on accuracy alone when evaluating melanoma classifiers, where sensitivity (catching true positives) is clinically essential.
Beyond diagnosis, ML algorithms were also applied to predict prognosis, including melanoma recurrence risk and metastasis probability. These models integrate dermoscopy image features with clinical and histopathological data to assess disease progression and inform treatment decisions. However, the review notes that prognostic applications remain significantly underexplored compared to diagnostic tasks.
Deep Convolutional Neural Networks (DCNN): DCNNs automatically learn hierarchical representations from raw pixel data, capturing both low-level features (edges, textures) and high-level features (object shapes, patterns). They consist of convolutional layers that extract local patterns, pooling layers that down-sample while preserving important features, and fully connected layers for classification. Among all architectures reviewed, ResNet and VGGNet were the most frequently used, appearing 17 and 14 times respectively across the 34 studies.
DenseNet and DenseNet-II: DenseNet connects each layer to all subsequent layers, using all preceding feature maps as input. This dense connectivity facilitates both down-sampling and feature concatenation across multiple dense blocks, totaling 201 layers. DenseNet-II, an improved variant, achieved 96.27% accuracy with 97.1% sensitivity on the HAM10000 dataset. DenseNet201 paired with transfer learning reached 95% accuracy on both ISIC 2017 and PH2 datasets, with 97% sensitivity on ISIC 2017 and 93% on PH2.
Specialized architectures: Several purpose-built networks showed strong results. DSCC_Net (Deep Learning-Based Skin Cancer Classification Network) identified four skin cancer types at fixed 150 x 150 pixel resolution and achieved 94.17% accuracy. SNC_Net integrated handcrafted and deep learning features with SMOTE-Tomek balancing, reaching 97.81% accuracy and 97.89% sensitivity on ISIC 2019. SCDNet achieved 96.91% accuracy with 92.18% sensitivity on ISIC 2019. Skin-Net used multilevel feature extraction with cross-channel correlation, reaching 99.29% accuracy on MED-NODE, 99.15% on DermIS/DermQuest, and 98.14% on ISIC 2017.
Segmentation-focused networks: U-Net, with its contracting-expansive architecture and skip connections, was applied for pixel-wise lesion segmentation. FCRN (Fully Convolutional Residual Networks) incorporated multi-scale background integration to improve segmentation under limited training data. FrCN (Full Resolution Convolutional Networks) preserved full spatial resolution without subsampling, producing finely segmented lesion contours directly from raw input without preprocessing.
The review found that combining multiple methods, such as AlexNet with transfer learning or GoogleNet with transfer learning, consistently yielded higher accuracy and sensitivity than single-architecture approaches. Transfer learning was particularly effective at leveraging pre-trained features from large image datasets to compensate for limited dermoscopy training data.
The review compiled accuracy, sensitivity, specificity, precision, F-measure, and AUC-ROC across all 34 studies. The top-performing architectures by accuracy were DenseNet (including DenseNet201 and DenseNet-II), XG-Boost, lightweight deep learning networks, and SMTP (Similarity Measure for Text Processing), all exceeding 96% accuracy. At the lower end, FCRN and SVM achieved approximately 85% accuracy.
The sensitivity problem: Accuracy alone tells an incomplete story. XG-Boost reached 97.22% accuracy but only 12.60% sensitivity, meaning it missed the vast majority of actual melanoma cases. In contrast, DenseNet and SMTP maintained both high accuracy and high sensitivity. DenseNet-II achieved 97.1% sensitivity alongside 96.27% accuracy on HAM10000. SMTP paired with CNN reached 98.04% sensitivity with 96% accuracy on ISIC 2019. The lightweight deep learning network achieved 98.7% accuracy with 99.27% sensitivity on ISIC 2018, one of the strongest combined results.
Dataset-specific performance: GoogleNet combined with transfer learning reached 95% accuracy on both PH2 and ISIC 2019 with 92.5% sensitivity on each. AlexNet with transfer learning scored 98.6% on ISIC 2019 and 90.48% on ISIC 2018. The DCNN-based approach on ISIC datasets showed progressive improvement: 81.41% accuracy on ISIC 2016, 88.23% on ISIC 2017, and 90.42% on ISIC 2020, suggesting that larger and more recent datasets support better model generalization.
Specificity and AUC-ROC: Several models showed strong specificity scores. DenseNet-II achieved 97.3% specificity, and Skin-Net reached 99.38% on MED-NODE and 99.41% on DermIS/DermQuest. The FrCN segmentation model scored 96.69% specificity on ISBI 2017 and 95.65% on PH2. AUC-ROC values, where reported, ranged from 0.92 to 0.99, with SCDNet achieving 0.9893 and the ensemble lightweight network reaching 0.9681.
The authors emphasize that DenseNet and SMTP used larger datasets (HAM10000 and ISIC 2019 respectively), which likely contributed to their balanced performance. They recommend selecting larger datasets to achieve more reliable and clinically meaningful diagnostic accuracy.
Dataset size and overfitting: A recurring limitation across the reviewed studies is the lack of access to large, well-labeled datasets. Small sample sizes lead to overfitting and reduce model generalizability. The review found that most studies achieving above 90% accuracy used datasets with fewer than 1,000 images, and no study exceeded 5,000 images. High accuracy on small datasets often reflects memorization rather than genuine learning, which undermines clinical reliability.
Single-modality focus: Most studies relied exclusively on dermoscopy images, overlooking complementary data types such as genomic information, histopathological data, and clinical records. Genomic data could provide insights into gene activity, identify potential biomarkers for metastatic melanoma, and suggest therapeutic targets. A multi-modal approach combining dermoscopy with genomic, histopathological, and clinical data could significantly improve both diagnostic and prognostic accuracy and enable personalized treatment recommendations.
Model interpretability: Deep learning models, particularly CNNs, function as black boxes, making it difficult for clinicians to understand and trust their outputs. Different researchers use different datasets, analytical techniques, and computational resources, further complicating cross-study comparisons. The lack of standardized explainability tools limits clinical adoption. The authors specifically call out the need for techniques like LIME (Local Interpretable Model-Agnostic Explanations) and Grad-CAM (Gradient-weighted Class Activation Mapping) to provide visual and interpretable feedback.
Computational constraints: Several deep learning models require substantial computational resources for training and inference, limiting deployment in resource-constrained clinical environments. Lightweight architectures such as MobileNet and efficient compression techniques could make AI-based diagnostics more accessible, but their adoption across the reviewed studies was limited.
Generative AI for data expansion: Generative Adversarial Networks (GANs) have shown significant potential for augmenting small datasets by creating synthetic yet realistic dermoscopy images. Despite this promise, their application in melanoma diagnosis remains underexplored. The authors argue that GAN-generated images could substantially improve model training and robustness, particularly for rare conditions like melanoma where obtaining large labeled datasets is inherently difficult.
Federated learning and data sharing: The review recommends developing common data-sharing platforms and federated learning (FL) techniques that allow models to be trained on distributed datasets without violating patient privacy. Federated learning enables multiple institutions to collaboratively train models while keeping patient data local, addressing both the dataset size problem and privacy regulations simultaneously. The creation of open-access annotated melanoma datasets is identified as a crucial step forward.
Prognostic modeling: While melanoma diagnosis has been widely studied, the use of ML and DL models to predict melanoma prognosis, including survival rates and recurrence risks, has received far less attention. The authors call for integrating additional clinical data such as patient history, genetic markers, and treatment response data into predictive models for long-term outcomes. This could transform AI from a purely diagnostic tool into a comprehensive clinical decision-support system.
Lightweight and explainable models: To address computational barriers, future research should explore efficient architectures like MobileNet and model compression techniques suitable for deployment on standard clinical hardware. Simultaneously, integrating LIME and Grad-CAM into model pipelines would provide visual explanations of predictions, building clinician trust and facilitating regulatory approval. Explainability combined with strong quantitative performance represents the clearest path to real-world clinical integration.