Deep Learning for Melanoma Detection

Overview & Background

Pages 1-3

Why Melanoma Detection Remains a Critical Challenge

Melanoma is an aggressive cancer originating from melanocytes, the pigment-producing cells in the epidermis. Despite representing only a small fraction of all skin cancer diagnoses, it accounts for the majority of skin-cancer-related deaths worldwide. The disease is driven by genetic mutations in genes such as BRAF, NRAS, and CDKN2A, combined with environmental factors like ultraviolet radiation exposure. Clinically, melanoma can closely mimic benign melanocytic lesions (nevi), making early detection a significant diagnostic challenge. Delayed or inaccurate diagnoses lead to poorer outcomes, as the cancer rapidly progresses from localized cutaneous lesions to regional lymph node involvement and distant organ metastasis.

Diagnostic variability: Traditional diagnostic approaches, including visual inspection and dermoscopy, rely heavily on dermatological expertise and suffer from considerable inter-observer variability. Convolutional neural networks (CNNs) have emerged as a promising solution, with studies showing they can achieve diagnostic accuracies comparable to, or even surpassing, those of experienced dermatologists. Systematic reviews by Wu et al. (2022) and Magalhães et al. (2024) have identified key factors influencing CNN performance, including dataset quality, image resolution, and annotation accuracy.

Transfer learning advances: Work by Naeem et al. (2020) and Dildar et al. (2021) showed that transfer learning and pre-trained models significantly reduce the computational and data requirements for training CNNs in melanoma detection. However, challenges persist around dataset bias, class imbalance, and model interpretability, as highlighted by Efimenko et al. (2020) and Popescu et al. (2022). This study aims to compare four CNN architectures for binary classification of dermoscopic images to find the optimal balance of accuracy and efficiency.

TL;DR: Melanoma causes the majority of skin cancer deaths despite being a small fraction of cases. CNNs can match or exceed dermatologist-level accuracy, but challenges remain around dataset bias, interpretability, and inter-observer variability. This study benchmarks four CNN architectures on 8,825 dermoscopic images.

Methodology

Pages 3-4

Dataset Preparation and Image Augmentation Pipeline

Dataset source: The authors used 8,825 dermoscopic images from the DermNet repository, a publicly available resource widely used in dermatology research. The images were categorized into two classes: benign and malignant skin lesions. All images were resized to a standardized resolution of 224 x 224 pixels, which is a standard CNN input size that balances computational efficiency with detail preservation.

Data splitting: Using the split-folders library, the dataset was divided into training (7,060 images, 80%), validation (882 images, 10%), and testing (883 images, 10%) subsets. A stratified splitting approach ensured even representation of benign and malignant classes across all subsets, preventing class imbalance from biasing model training. The test subset was strictly isolated from the training process to ensure unbiased performance assessment.

Augmentation pipeline: TensorFlow's ImageDataGenerator was used to apply dynamic augmentation during training. The pipeline included pixel rescaling to [0, 1], rotation up to 30 degrees, width and height shifts of up to 20%, shearing transformations of 20%, zoom variations of 20%, horizontal and vertical flips, and brightness adjustments within a range of [0.8, 1.2]. The fill mode was set to "nearest" to handle transformed pixel spaces. Crucially, augmentation was applied only to training data, while validation and testing datasets underwent only basic normalization to maintain evaluation integrity.

TL;DR: 8,825 DermNet images split 80/10/10 into training (7,060), validation (882), and test (883) sets. Images standardized to 224 x 224 pixels. Augmentation included rotation (30 degrees), shifts (20%), zoom (20%), flips, and brightness variation, applied only during training.

Model Architecture

Pages 4-5

Four CNN Architectures: DenseNet121, ResNet50V2, NASNetMobile, and MobileNetV2

DenseNet121: Pre-trained on ImageNet, this architecture employs a dense connectivity pattern where each layer receives direct connections from all preceding layers. This promotes feature reuse and efficient gradient flow. Its model size is 27.86 MB, providing a good balance between computational requirements and model complexity.

ResNet50V2: Also pre-trained on ImageNet, ResNet50V2 implements a residual learning framework with skip connections that facilitate improved gradient flow throughout the network. At 91.93 MB, it is the largest model in the comparison, but it has demonstrated exceptional performance across various computer vision tasks. NASNetMobile: Developed through Neural Architecture Search (NAS), an automated process that discovers optimal network topologies, this architecture was specifically optimized for mobile devices. At 17.34 MB it represents a compromise between efficiency and performance. MobileNetV2: The most compact architecture at just 9.89 MB, it employs depth-wise separable convolutions and incorporates inverted residuals with linear bottlenecks, making it highly efficient while maintaining competitive accuracy.

Custom classification head: All four architectures were modified with an identical custom top-layer configuration for binary classification. This included a global average pooling layer, batch normalization, a dense layer with 512 units and ReLU activation, a dropout layer (rate 0.5), another dense layer with 256 units and ReLU activation, a second dropout layer (rate 0.3), and a final output layer with a single unit and sigmoid activation. This standardized head ensured fair comparison across all models.

TL;DR: Four ImageNet-pretrained CNNs compared: DenseNet121 (27.86 MB), ResNet50V2 (91.93 MB), NASNetMobile (17.34 MB), and MobileNetV2 (9.89 MB). All shared the same custom classification head with two dense layers (512 and 256 units), dropout (0.5 and 0.3), and sigmoid output.

Training & Evaluation

Pages 5-6

Training Configuration, Optimization, and Evaluation Protocol

Optimizer and hyperparameters: All models were trained using the Adam optimizer with an initial learning rate of 1 x 10^-4, batch size of 32 images, and a maximum of 50 epochs. The loss function was binary cross-entropy, well suited for the binary classification task. Training was performed on Google Colab's GPU runtime with NVIDIA GPU acceleration, using TensorFlow 2.x with the Keras API and Python 3.10.

Callback mechanisms: Three callback strategies were implemented to optimize training. Early stopping monitored validation loss with a patience of 10 epochs, automatically terminating training when no improvement was observed and restoring the best weights. A learning rate reduction scheme reduced the rate by a factor of 0.2 when validation loss plateaued, with a patience of 5 epochs and a minimum threshold of 1e-6. Model checkpointing saved the best-performing model states based on validation accuracy.

Evaluation protocol: Final model evaluation used the held-out test dataset (883 images) and measured accuracy, AUC-ROC, inference time (averaged across multiple batches with standard deviations), and model size. McNemar's statistical test was employed to determine whether performance differences between model pairs were statistically significant, providing a rigorous foundation for comparison beyond raw accuracy numbers.

Reproducibility: Fixed random seeds were used for all stochastic processes, and all hyperparameters and configuration settings were documented. Model checkpoints were saved in Keras format, preserving both architecture and weights for reloading and deployment.

TL;DR: All models trained with Adam optimizer (lr = 1e-4), batch size 32, max 50 epochs, binary cross-entropy loss. Early stopping (patience 10), learning rate reduction (factor 0.2, patience 5), and model checkpointing were used. McNemar's test validated statistical significance of performance differences.

Results

Pages 6-8

Classification Accuracy, AUC-ROC, and Statistical Significance

Accuracy rankings: DenseNet121 achieved the highest classification accuracy at 92.30%, followed closely by MobileNetV2 at 92.19%, ResNet50V2 at 91.85%, and NASNetMobile at 90.94%. All models maintained high precision while achieving strong recall rates, with DenseNet121 showing the most balanced performance across all evaluation criteria. The precision-recall trade-off analysis confirmed that no model sacrificed one metric heavily in favor of the other.

AUC-ROC analysis: ResNet50V2 achieved the highest AUC score of 0.957, demonstrating superior discriminative ability across various classification thresholds. DenseNet121 followed with an AUC of 0.951, while MobileNetV2 and NASNetMobile achieved 0.943 and 0.935, respectively. The ROC curves showed particularly strong performance in the critical low false-positive rate region (0.1 to 0.3), where ResNet50V2 and DenseNet121 demonstrated marginally better discrimination. This is clinically significant because minimizing false positives while maintaining high sensitivity reduces unnecessary biopsies and patient anxiety.

Statistical significance: McNemar's test confirmed significant performance differences among all model pairs (p < 0.0001). The strongest statistical differences were observed between NASNetMobile and DenseNet121 (test statistic: 10,076.50), followed by NASNetMobile and MobileNetV2 (test statistic: 5,950.10). The more moderate difference between ResNet50V2 and MobileNetV2 (test statistic: 132.67) indicates closer alignment in their learning strategies. All training processes demonstrated stable convergence, and early stopping effectively prevented overfitting.

TL;DR: DenseNet121 led in accuracy (92.30%), ResNet50V2 led in AUC (0.957). MobileNetV2 was close behind at 92.19% accuracy and 0.943 AUC. All pairwise differences were statistically significant (p < 0.0001). The largest gap was between NASNetMobile and DenseNet121 (McNemar statistic: 10,076.50).

Computational Efficiency

Pages 7-9

Model Size, Inference Time, and Deployment Implications

Inference time comparison: MobileNetV2 was the fastest at 23.46 ms per image, followed by ResNet50V2 at 26.55 ms, DenseNet121 at 57.89 ms, and NASNetMobile at 108.67 ms. The nearly 5x difference between MobileNetV2 and NASNetMobile is striking and has direct implications for real-time clinical applications. MobileNetV2's speed, combined with its competitive accuracy (92.19%), makes it particularly well suited for point-of-care devices and mobile diagnostics where real-time processing is essential.

Model size trade-offs: MobileNetV2 also had the smallest footprint at 9.89 MB, compared to NASNetMobile (17.34 MB), DenseNet121 (27.86 MB), and ResNet50V2 (91.93 MB). Notably, NASNetMobile's compact size (17.34 MB) did not translate to fast inference, as its 108.67 ms processing time was the slowest by a wide margin. This demonstrates that architectural complexity does not always correlate with model size or speed. ResNet50V2, despite being the largest model at 91.93 MB, achieved the second-fastest inference time (26.55 ms), suggesting that its residual learning framework is computationally efficient at runtime.

Deployment considerations: For primary care or screening settings where throughput matters, MobileNetV2 offers the best combination of accuracy, speed, and small footprint. For applications where maximizing discriminative power is the priority, such as specialist referral workflows, ResNet50V2's highest AUC (0.957) may justify its larger storage requirement. DenseNet121 represents a middle ground, delivering the best accuracy (92.30%) at a moderate model size (27.86 MB), though its inference time (57.89 ms) is roughly 2.5x slower than MobileNetV2.

TL;DR: MobileNetV2 was fastest (23.46 ms) and smallest (9.89 MB) while maintaining 92.19% accuracy. ResNet50V2 was second-fastest (26.55 ms) despite being the largest (91.93 MB). NASNetMobile was the slowest (108.67 ms) despite being relatively compact (17.34 MB), showing that architecture complexity, not size, determines speed.

Limitations

Pages 9-10

Key Limitations: Resolution, Scope, and Missing Clinical Context

Fixed image resolution: All models used a static input size of 224 x 224 pixels, which may not be optimal for capturing all relevant diagnostic features in skin lesions of varying sizes. Higher-resolution images could preserve critical diagnostic details, particularly for distinguishing subtle features in atypical (dysplastic) nevi and other challenging cases. Clinical dermoscopy images are typically captured at much higher resolutions, meaning downsampling could discard diagnostically important information.

Binary classification scope: The study focused exclusively on differentiating benign melanocytic nevi from melanomas, and did not address the more complex challenge of distinguishing atypical nevi or non-nevus pigmented neoplasms (such as dermatofibromas, lentigines, and seborrheic keratoses) from melanomas. In real-world clinical practice, dermatologists must differentiate among a much broader range of lesion types, so a binary classifier has limited standalone clinical utility without expansion to multi-class frameworks.

No pretest probability integration: The algorithm does not incorporate pretest probability into its diagnostic framework. Clinical decision-making is heavily influenced by factors such as personal and family history of melanoma, patient age, lesion location, and prior occurrences. Without this contextual information, the model treats all patients as having equal baseline risk, which does not reflect real clinical practice. Integrating patient metadata could improve performance by addressing the variability in pretest probabilities across individual cases.

Single dataset and environment: All experiments were conducted on a single dataset (DermNet) and evaluated in a controlled computational environment (Google Colab). The generalizability of these results to other datasets, image capture devices, skin tones, and clinical populations has not been tested. External validation on independent, multi-center datasets would be necessary before any clinical deployment.

TL;DR: Key limitations include fixed 224 x 224 pixel input (potentially losing fine details), binary-only classification (no atypical nevi or non-melanoma pigmented lesions), no integration of patient history or pretest probability, and single-dataset evaluation without external validation on diverse populations.

Future Directions

Pages 10-12

Ensemble Models, Vision Transformers, and Clinical Integration

Ensemble and hybrid approaches: The authors propose combining multiple architectures to leverage their complementary strengths. For example, integrating DenseNet121's robust feature extraction with MobileNetV2's efficiency could improve both diagnostic performance and deployment feasibility. Ghosh et al. (2024) demonstrated that majority voting ensembles of diverse deep learning models effectively reduce variance and bias, achieving balanced sensitivity and specificity. Hybrid models incorporating traditional handcrafted features (lesion texture, color histograms) alongside CNN-derived embeddings have reached accuracies up to 99% in some studies.

Vision transformers and alternative architectures: Zhang et al. (2023) showed that Vision Transformers (ViTs), which use self-attention mechanisms, can outperform traditional CNNs in multi-class skin lesion classification by capturing global contextual relationships across entire images. EfficientNet's compound scaling approach, as described by Tan and Le (2019), offers another avenue for increasing accuracy while maintaining computational efficiency. Few-shot learning techniques could enable rapid adaptation to rare skin conditions underrepresented in existing datasets.

Higher resolution and metadata integration: Future work should explore higher-resolution inputs (such as 1000 x 1000 pixels) that better reflect real-world clinical imaging. Combining high-resolution inputs with adaptive preprocessing pipelines could optimize image quality without sacrificing computational feasibility. Integrating patient metadata, including age, lesion location, clinical history, and family history of melanoma, would enable more personalized and context-aware predictions, particularly for high-risk individuals.

Interpretability and clinical deployment: Advanced visualization techniques such as heatmaps and saliency maps could highlight regions of interest within images, giving clinicians insight into model reasoning. This transparency is essential for building trust in AI-driven diagnostics. The authors emphasize that these systems should function as assistive tools, not replacements for clinical expertise, and that robust validation, clinician-focused training, and explainability mechanisms are prerequisites for successful integration into clinical workflows.

TL;DR: Future directions include ensemble methods combining DenseNet121 and MobileNetV2 strengths, Vision Transformers for global context, higher-resolution inputs (1000 x 1000 pixels), patient metadata integration, and saliency maps for interpretability. Hybrid models have achieved up to 99% accuracy in related work.

Deep Learning for Melanoma Detection: A Deep Learning Approach to Differentiating Malignant from Benign Melanocytic Lesions

Original Paper (PDF)

Plain-English Explanations