Skin cancer is one of the most common cancers worldwide, and melanoma remains the most lethal subtype. Deep learning models built on Convolutional Neural Networks (CNNs) have shown remarkable potential for classifying skin lesions from dermoscopy images. Architectures like VGG16, ResNet, Inception, and DenseNet have individually achieved dermatologist-level accuracy on curated datasets. However, two persistent problems block clinical translation: limited cross-dataset generalization due to domain shift (differences in imaging devices, clinical centers, and patient populations), and the lack of interpretability that clinicians need to trust and adopt AI tools.
Most existing approaches rely on a single CNN architecture or a straightforward configuration of pretrained networks. While these can score well on a specific test set, they often struggle when confronted with variations in lesion morphology, image quality, and acquisition settings. Ensemble learning, which combines multiple models to exploit each one's strengths, is a natural solution. Yet ensemble strategies had not been thoroughly explored for skin cancer detection at the time of this study.
The EnsembleSkinNet proposal: The authors introduce an ensemble deep learning framework that fuses four pretrained CNN architectures (Modified VGG16, ResNet50, Inception V3, and DenseNet201) through a softmax-weighted averaging mechanism. The framework also integrates Gradient-weighted Class Activation Mapping (Grad-CAM) for visual explainability and uses Bayesian hyperparameter optimization for tuning. The goal is to deliver a system that generalizes across acquisition conditions, offers interpretable rationales for its predictions, and remains practical for clinical or teledermatology use.
Key results at a glance: On the HAM10000 dataset with five-fold cross-validation, EnsembleSkinNet achieved an accuracy of 98.32 +/- 0.41%, precision of 98.20 +/- 0.35%, recall of 98.10 +/- 0.38%, and F1-score of 98.15 +/- 0.37%. External validation on the ISIC 2020 dataset without any retraining yielded 96.84 +/- 0.42% accuracy and an AUC of 0.983. Grad-CAM explainability analysis achieved a mean Explainability Accuracy of 93.6% with a Cohen's kappa of 0.87 against dermatologist annotations.
Dataset: The HAM10000 dataset contains 10,015 dermoscopic images spanning seven lesion classes: melanocytic nevi (NV, 6705 images), melanoma (MEL, 1113), benign keratosis (BKL, 1099), basal cell carcinoma (BCC, 514), actinic keratosis (AKIEC, 327), vascular lesions (VASC, 142), and dermatofibroma (DF, 115). The severe class imbalance (NV dominates at 67% of the dataset) posed a significant challenge for training.
Preprocessing pipeline: Before any augmentation, patient-level and lesion-level stratified splitting was applied using the metadata fields (lesion_id, image_id) to ensure all images from a given lesion or patient went entirely into one subset, preventing data leakage. Near-duplicate images were then filtered using SSIM (Structural Similarity Index) with a threshold of 0.95. Images were resized to 224 x 224 pixels and normalized to the [0, 1] range. The dataset was split into 70% training, 15% validation, and 15% testing subsets.
Data augmentation: Targeted augmentation (random rotations of +/- 25 degrees, horizontal and vertical flips, zoom of 0.9-1.1x, and brightness adjustments of +/- 10%) was applied only to the training set after splitting, specifically to minority classes to achieve approximate balance at roughly 2,000 samples per class. This ensured validation and test sets remained untouched.
Architecture design: Each of the four pretrained models (M-VGG16, ResNet50, Inception V3, DenseNet201) was initialized with ImageNet weights. The initial convolutional layers were frozen to preserve general feature extraction capabilities. A custom classifier head was appended to each model, consisting of a Global Average Pooling (GAP) layer, a dense layer with ReLU activation, a dropout layer (rate = 0.5) for regularization, and a final softmax output layer for the seven-class classification task. The Modified VGG16 added batch normalization after each convolutional block and replaced fully connected layers with GAP, totaling approximately 18.9M parameters (4.1 GFLOPs per image).
Transfer learning and fine-tuning: The training strategy used a two-phase approach. In the first phase, the early convolutional layers were frozen to retain general feature extraction capabilities (edge detection, texture recognition) while training only the custom classifier head. In the second phase, the deeper layers were progressively unfrozen to allow the models to adapt to the specific patterns in dermoscopic images. A two-stage learning rate schedule was used (10^-4 then 10^-5) with the Adam optimizer (beta1 = 0.9, beta2 = 0.999), batch size of 32, and early stopping on validation loss for up to 50 epochs.
Ensemble integration: The core novelty lies in the softmax-weighted averaging mechanism for combining predictions from all four models. Each model's contribution to the final prediction was weighted based on its validation accuracy, normalized through a softmax function. This meant that better-performing models had proportionally more influence on the final classification decision. The final ensemble prediction was computed as P(y|xi) = sum of wm * Pm(y|xi) for m = 1 to 4, where wm is the weight assigned to model m.
Loss and rebalancing strategy: To further address class imbalance, the framework combined three mechanisms. First, class-weighted sampling assigned each lesion category a weight inversely proportional to its frequency (wi = N / (C * ni)). Second, focal loss (with gamma = 2.0) suppressed the influence of well-classified majority samples and increased gradient flow for difficult or rare lesions. Third, these were combined within the categorical cross-entropy framework as the overall training objective.
Hyperparameter tuning: Bayesian optimization was used instead of grid or random search for tuning learning rate, batch size, dropout rate, and the number of trainable layers during fine-tuning. This approach modeled the performance function probabilistically and refined the search space based on prior evaluations, finding optimal configurations more efficiently. A five-fold cross-validation strategy was embedded within the optimization to ensure hyperparameters generalized across the full dataset.
Overall performance: EnsembleSkinNet achieved an accuracy of 98.32 +/- 0.41%, macro recall of 97.52 +/- 0.39%, macro F1 of 97.80 +/- 0.37%, and balanced accuracy of 97.67 +/- 0.40% on HAM10000. These metrics were computed as mean +/- standard deviation over five-fold cross-validation repeated with five independent random seeds (25 observations per model), ensuring statistical stability.
Per-class performance: After targeted augmentation, per-class precision/recall/F1 values were highly uniform. Melanocytic nevi achieved 98.5/98.1/98.3%, melanoma reached 97.9/97.2/97.5%, benign keratosis hit 98.1/97.8/97.9%, basal cell carcinoma scored 98.8/98.4/98.6%, actinic keratosis achieved 97.4/96.8/97.1%, dermatofibroma reached 96.7/96.3/96.5%, and vascular lesions scored 98.6/98.0/98.3%. The macro averages were 98.0 +/- 0.5% precision, 97.5 +/- 0.6% recall, and 97.8 +/- 0.5% F1-score.
Fusion strategy ablation: Three ensemble fusion approaches were tested. Softmax-weighted averaging achieved 98.32% accuracy with the lowest variance. A stacking meta-learner (shallow MLP trained on validation folds) reached 98.10% accuracy and slightly improved minority-class recall but added complexity. Logit-level blending (ridge-regularized linear regression over model logits) scored 98.05% and offered no advantage over weighted averaging.
Individual model contributions: Removing each backbone one at a time from the full ensemble confirmed that every component contributed. Removing DenseNet201 caused the largest drop (-0.74% accuracy), followed by ResNet50 (-0.48%), Inception V3 (-0.37%), and M-VGG16 (-0.31%). Ensemble size sensitivity tests showed that a 2-model ensemble (DenseNet201 + ResNet50) scored 97.45%, a 3-model ensemble added Inception V3 for 97.95%, and the full 4-model ensemble achieved 98.32%.
Individual pretrained model comparison: When trained independently on HAM10000, the four backbone models achieved the following accuracies: DenseNet201 at 97.05%, ResNet50 at 96.12%, Inception V3 at 95.34%, and M-VGG16 at 94.78%. EnsembleSkinNet's 98.32% accuracy represented a 1.27% improvement over the best single model (DenseNet201) and a 3.54% improvement over the weakest (M-VGG16). Precision, recall, and F1-score followed the same pattern, with EnsembleSkinNet leading across all metrics.
Modern baseline comparison: To ensure fair evaluation, three contemporary architectures were trained with the same data splits, augmentations, learning rates, and optimizer settings: ConvNeXt-Tiny (2022) achieved 96.92 +/- 0.50% accuracy, EfficientNet-V2-S (2021) reached 96.62 +/- 0.53%, and ViT-B/16 (2021) scored 96.02 +/- 0.60%. EnsembleSkinNet outperformed all three by 1.4 to 2.3% in absolute accuracy, highlighting the benefit of cross-architecture ensembling for complex dermatological tasks.
Comparison with published methods: EnsembleSkinNet surpassed numerous recent methods in the literature. Zhao et al. achieved 94.20% accuracy, Sharma et al. reached 93.10%, Tlaisun et al. scored 92.30%, Imran et al. hit 92.10%, Qureshi and Roos obtained 91.90%, Saba et al. reached 92.40%, and Demir et al. scored 91.60%. EnsembleSkinNet's 98.32% accuracy exceeded the best published baseline by over 4 percentage points.
Statistical significance: The Friedman test across all models yielded chi-squared = 19.42 with p = 0.0006, rejecting the null hypothesis that all models perform equally. Post-hoc Wilcoxon signed-rank tests confirmed that EnsembleSkinNet significantly outperformed each baseline individually (all p-values below 0.001). Bootstrap 95% confidence intervals for EnsembleSkinNet were narrow: accuracy CI [97.88, 98.74] and F1-score CI [97.74, 98.56].
Grad-CAM explainability: To address the "black box" problem of deep learning, the framework integrated Gradient-weighted Class Activation Mapping (Grad-CAM). This technique generates heatmaps highlighting the image regions most influential for the model's prediction by computing importance scores from the gradients flowing into the final convolutional layers. EnsembleSkinNet achieved the highest explainability accuracy at 96.5%, meaning that in 96.5% of analyzed cases the Grad-CAM heatmaps correctly highlighted the lesion regions. Only 3.5% of heatmaps were misfocused on irrelevant areas.
Individual model explainability: For comparison, DenseNet201 achieved 92.3% explainability accuracy, ResNet50 reached 89.5%, Inception V3 scored 87.8%, and M-VGG16 had 86.2%. The ensemble's superiority in explainability stems from fusing diverse architectural perspectives, producing more precise localization of diagnostically relevant regions. A quantitative Explainability Accuracy metric was defined as the fraction of samples where Grad-CAM activations overlapped with dermatologist-annotated lesion masks by at least 70% (IoU >= 0.7), yielding a mean of 93.6% with inter-rater reliability of kappa = 0.87.
External validation on ISIC 2020: The ISIC 2020 Challenge dataset (2,357 dermoscopic images from different institutions and imaging devices) was used for external validation. Crucially, no retraining or fine-tuning was performed. EnsembleSkinNet achieved 96.84 +/- 0.42% accuracy and AUC of 0.983 +/- 0.005. The best single-model baselines were DenseNet201 (94.75 +/- 0.58% accuracy, AUC 0.964) and ResNet50 (93.92 +/- 0.63% accuracy, AUC 0.958). The roughly 1.5% accuracy drop from HAM10000 was deemed minor given the inherent domain shift.
The cross-dataset results demonstrate that the softmax-weighted fusion of diverse architectures extracts domain-invariant lesion representations. The ensemble compensated for distributional shifts in illumination, resolution, and lesion demographics more effectively than any single backbone. This cross-institutional performance is a critical prerequisite for regulatory acceptance and real-world teledermatology deployment.
Computational complexity: The full EnsembleSkinNet ensemble totals 186.2M parameters with approximately 16.4 GFLOPs total (4.1 GFLOPs per backbone per image). Training time was 182.4 seconds per epoch, and inference latency was roughly 64.8 ms per image on an NVIDIA RTX 3090 GPU. By comparison, individual models ranged from 17.8M to 24.0M parameters with inference latencies of 62.5 to 68.4 ms per image. While the ensemble's throughput is sufficient for offline or asynchronous clinical workflows, it is too heavy for real-time mobile or edge deployment.
Knowledge distillation: To address this, the authors performed teacher-student distillation using MobileNet-V3 Small as the student model. The distillation loss combined hard cross-entropy supervision with soft label alignment from the ensemble teacher, using a temperature of T = 3 and a blending coefficient of alpha = 0.7. The distilled student achieved 96.97 +/- 0.44% accuracy, 96.88 +/- 0.40% F1-score, and AUC of 0.974, with only 11.9M parameters and 0.95 GFLOPs.
Practical deployment gains: The MobileNet-V3 student delivered 4x lower latency (15.6 ms per image versus 64.8 ms), 15x smaller parameter count, and retained approximately 97% of the teacher's accuracy. This makes deployment feasible on edge GPUs, embedded AI systems (such as NVIDIA Jetson and Coral TPU), and standard hospital workstations. The accuracy-latency tradeoff is modest: a 1.35% accuracy reduction yields a dramatic reduction in computational requirements.
Future compression plans include INT8 and FP16 model quantization, edge AI optimization via TensorRT and ONNX Runtime, and federated deployment pipelines for multi-center learning. These additional steps would further reduce the gap between research-grade and clinically deployable models.
Demographic and dataset bias: The HAM10000 dataset primarily contains images from individuals with fair skin, making it unrepresentative of diverse skin tones and ethnicities. This demographic bias may undermine generalization to underrepresented populations. While data augmentation helps expand training diversity synthetically, it cannot substitute for genuinely diverse, real-world clinical data. The cross-dataset validation (HAM10000 to ISIC 2020) is an initial step, but broader multi-institutional and multi-device testing remains necessary.
Clinical deployment barriers: Translating the framework into clinical practice will require addressing variability across dermatoscopic devices, illumination conditions, and regulatory approvals. The study was conducted in a research environment, and real-world clinical validation has not yet been performed. Additionally, the computational demands of the full ensemble may limit adoption in resource-constrained clinical settings, though the distilled MobileNet-V3 student partially addresses this.
Explainability gaps: Although Grad-CAM visualizations provided compelling qualitative and quantitative evidence of model attention alignment (93.6% Explainability Accuracy, kappa = 0.87), the study did not employ standardized quantitative interpretability benchmarks commonly used in the literature. Formal human-AI evaluation protocols and established interpretability assessment standards were not applied, which limits comparison with other explainability methods.
Future directions: The authors plan to incorporate datasets from wider demographic and geographic contexts, use bias-aware learning methods to equalize performance across subpopulations, and integrate clinical metadata (patient age, lesion site, medical history) for richer diagnostic context. They also intend to expand the ensemble strategy with attention-based or transformer-based fusion mechanisms, extend the framework to federated learning for privacy-preserving multi-center collaboration, and explore domain-adapted pretraining on large-scale dermatology datasets like ISIC and Derm7pt. Comparison with classical ML baselines (SVM, Random Forest) using CNN-extracted features is also planned for future work.