Two-Stage Deep Neural Network via Ensemble Learning for Melanoma Classification

Frontiers in Bioengineering and Biotechnology 2022 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-2
Why Automated Melanoma Classification Matters

Skin cancer is a massive public health burden, with over 5 million new cases diagnosed annually in the United States alone. Melanoma, while less common than other skin cancers, is the fastest-growing and deadliest form. However, melanoma grows relatively slowly in its early stages, which means early diagnosis and prompt treatment can significantly improve patient survival rates. The challenge lies in making that early diagnosis reliably and consistently.

Dermoscopy imaging: Dermoscopy is a non-invasive imaging technique that magnifies and illuminates the skin while eliminating surface reflections, revealing subsurface structures that are invisible to the naked eye. Dermatologists traditionally evaluate dermoscopy images using the "ABCD" rule, which assesses asymmetry, boundary irregularities, color variations, and structural features of lesions. However, this process is time-consuming, subjective, and heavily dependent on the clinician's experience level. Inexperienced dermatologists frequently make diagnostic errors.

Core challenges in automation: Automated melanoma recognition faces three major obstacles. First, skin lesions exhibit high inter-class similarity and intra-class variation in color, shape, and texture, meaning different disease categories can look nearly identical while the same disease can appear quite different across patients. Second, lesion areas vary greatly in size, and boundaries between diseased and healthy skin are often blurred. Third, artifacts such as hair, rulers, and skin texture in dermoscopy images add noise that confuses classification algorithms.

This study, published in Frontiers in Bioengineering and Biotechnology in January 2022, proposes a two-stage deep neural network framework that uses ensemble learning to address these challenges. The authors combine U-Net-based lesion segmentation with five state-of-the-art classification networks, all integrated through a novel locally connected ensemble network. They evaluate on the ISIC 2017 challenge dataset and report an accuracy of 0.909 and an AUC of 0.911.

TL;DR: Melanoma is the deadliest skin cancer but is treatable when caught early. This paper proposes a two-stage pipeline combining U-Net segmentation with five deep learning classifiers and a novel ensemble network, achieving 0.909 accuracy and 0.911 AUC on the ISIC 2017 dataset.
Pages 2-3
From Handcrafted Features to Deep Learning Ensembles

Traditional methods: Early melanoma classification relied on manually extracted features such as shape, color, and texture descriptors. Barata et al. (2013) compared global and local feature approaches using a bag-of-features (BoF) classifier. Ganster et al. (2001) used hand-designed shape, boundary, and radiometric features with K-Nearest Neighbor (KNN) classification. Celebi et al. (2007) combined shape, color, and texture descriptors with non-linear support vector machines. These shallow models captured low-level patterns but lacked the high-level representations and generalization capability needed for reliable clinical use.

Deep CNN approaches: The rise of deep convolutional neural networks (DCNNs) brought significant improvements. Yu et al. (2016) combined DCNN with residual learning for joint segmentation and melanoma identification. Gonzalez-Diaz (2019) developed DermaKNet, a four-part CAD system using a Lesion Segmentation Network and a ResNet50-based classifier. Xie et al. (2020) proposed MB-DCNN, which performed simultaneous segmentation and classification with a novel rank loss to address class imbalance. Gessert et al. (2020) introduced patch-based attention for high-resolution dermoscopic images with a new weighting loss.

Attention mechanisms and data augmentation: Datta et al. (2021) explored Soft-Attention mechanisms across VGG, ResNet, Inception-ResNet-v2, and DenseNet architectures, demonstrating that attention improved baseline classification performance. Zunair and Hamza (2020) addressed class imbalance through conditional image synthesis, learning inter-class mappings to generate samples of underrepresented classes via unpaired image-to-image translation. Bdair et al. (2021) proposed FedPerl, a semi-supervised federated learning approach with peer anonymization for collaborative melanoma classification.

The common limitation across prior deep learning approaches is their reliance on a single network architecture. A single model can only extract features at specific scales and types, constraining overall classification performance. This motivates the authors' ensemble approach, which combines multiple complementary architectures to capture a richer set of discriminative features.

TL;DR: Previous work evolved from handcrafted features (KNN, SVM) to deep CNNs (ResNet50, DermaKNet, MB-DCNN) and attention mechanisms. Single-network approaches remain limited in feature diversity, motivating the ensemble strategy proposed in this paper.
Pages 3-5
U-Net Segmentation and Lesion-Focused Preprocessing

Data augmentation: Medical image datasets are typically small, which risks overfitting deep models. The ISIC 2017 training set contains only 2,000 images. To mitigate this, the authors apply multiple augmentation strategies: 180-degree rotation, horizontal and vertical flipping, and shifting the image height and width by 10%. Each original image generates five new samples, effectively expanding the training data six-fold.

U-Net architecture: For lesion segmentation, the authors employ U-Net, an end-to-end deep convolutional neural network with no fully connected layers. U-Net's encoder consists of four blocks, each containing two 3x3 convolution layers with ReLU activation followed by a max-pooling layer with stride 2. The decoder mirrors this with four blocks, each containing a deconvolution layer that doubles the feature map size, followed by two 3x3 convolution layers. Critically, U-Net uses skip connections that concatenate encoder feature maps with their corresponding decoder outputs, enabling the network to combine shallow spatial information with deep semantic information. The final prediction mask is generated through a 1x1 convolution layer.

Resize step: In most dermoscopy images, the lesion area occupies only a small portion of the frame, with the majority being non-lesion background that can interfere with classification. After generating segmentation masks, the authors crop the original images to the lesion bounding box and resize them to a fixed 224x224 resolution. This forces the classifier to focus on lesion-specific features rather than background noise. Experiments confirm the value of this step: with segmentation, Inception-v3 achieved 0.791 accuracy and 0.883 AUC, compared to only 0.698 accuracy and 0.781 AUC without segmentation. That is a 9.3 percentage point improvement in accuracy and a 10.2 percentage point improvement in AUC.

TL;DR: Stage 1 uses U-Net to segment lesions from dermoscopy images, then crops and resizes them to 224x224 to eliminate background noise. This preprocessing alone boosted Inception-v3 accuracy from 0.698 to 0.791 (+9.3 points) and AUC from 0.781 to 0.883 (+10.2 points).
Pages 4-6
Five Classification Networks Enhanced with Squeeze-and-Excitation Blocks

Squeeze-and-Excitation (SE) blocks: Before feeding images into the classifiers, the authors add SE blocks to each network. An SE block acts as a channel-wise attention mechanism. The "squeeze" operation applies global average pooling to compress each feature channel into a single scalar. The "excitation" operation passes these scalars through two fully connected layers with ReLU and sigmoid activations, producing a weight vector that maps each channel to a value between 0 and 1. These weights are then multiplied back with the original feature maps, amplifying informative channels and suppressing less useful ones. The SE block is placed after feature extraction and before the final classification layer in all five networks.

Inception-v3: Uses multi-scale convolution (1x1, 3x3, and factorized 5x5 kernels) to capture features at different spatial resolutions simultaneously. It replaces large 5x5 convolutions with two stacked 3x3 layers, and decomposes n x n kernels into asymmetric 1 x n and n x 1 pairs, reducing parameters while increasing non-linear representational capacity.

ResNet-50: Addresses the vanishing/exploding gradient problem using residual blocks with identity shortcut connections. Each block adds the input directly to the block's output (H_l = H_{l-1} + F(H_{l-1})), enabling effective training of 50 layers deep. DenseNet-169: Takes the residual connection idea further by concatenating every layer's output with all previous layers within a dense block, maximizing feature reuse while keeping parameter counts manageable (13.22M parameters vs. ResNet-50's 24.32M).

Inception-ResNet-v2: Combines Inception modules with residual learning, adding identity shortcuts to different Inception block types for faster convergence. It is the largest model in the ensemble at 54.87M parameters. Xception: Replaces standard Inception convolutions with depthwise separable convolutions that decouple channel and spatial correlations, achieving strong performance with 21.59M parameters. Each of the five networks is pre-trained on ImageNet and fine-tuned with a 128-dimensional fully connected layer followed by a softmax classifier for three-class prediction.

TL;DR: Five ImageNet-pretrained networks (Inception-v3, ResNet-50, DenseNet-169, Inception-ResNet-v2, Xception) are enhanced with SE attention blocks. Parameters range from 13.22M (DenseNet-169) to 54.87M (Inception-ResNet-v2). Each outputs a three-class softmax prediction.
Pages 6-7
Locally Connected Ensemble Network

The traditional approach to ensemble learning is simple averaging, where each network's output probabilities are averaged with equal weight. The authors argue this is suboptimal because different networks perform better on different classes, and averaging dilutes the strengths of the best-performing model for each class. Voting-based strategies also fail to capture class-specific specialization.

Local connection design: Instead, the authors construct a small neural network that takes the concatenated three-class probability outputs from all five classifiers (a 15-dimensional input vector) and learns to optimally combine them. The key architectural choice is using locally connected layers rather than fully connected layers. In a locally connected layer, each output neuron connects to only a subset of inputs, meaning each class prediction is influenced only by the five models' predictions for that specific class, not by predictions for unrelated classes. This prevents cross-class interference and lets the network learn class-specific weighting of the base models.

The ensemble network consists of just two locally connected layers followed by a softmax output layer. With only 423 trainable parameters, it is orders of magnitude smaller than any individual classifier. Training this ensemble network takes approximately 20 seconds, compared to 1,900 to 3,200 seconds for the base classifiers (at 100 epochs). Since all five base networks are independent, they can be trained in parallel, further reducing total training time.

The design reflects a principled understanding of the ensemble problem: rather than treating all models equally or allowing arbitrary interactions between class predictions, the locally connected architecture encodes the inductive bias that each class should be determined independently by the ensemble of models' predictions for that class alone.

TL;DR: A lightweight ensemble network with only 423 parameters uses locally connected layers to combine five classifiers' outputs. Unlike averaging, it learns class-specific weighting, preventing cross-class interference. Training takes just 20 seconds.
Pages 7-8
Multi-Class Classification Performance

Dataset: The ISIC 2017 challenge dataset contains 2,750 dermoscopy images split into 2,000 training, 150 validation, and 600 test images. The three classes are highly imbalanced: benign nevi (BN) dominates with 1,372 training images, while melanoma (MM) has 374 and seborrheic keratosis (SK) has only 254 training images. Due to this imbalance, accuracy alone is unreliable, so AUC serves as the primary evaluation metric.

Individual network performance: Among the five base classifiers, Xception performed best with 0.810 accuracy and 0.896 AUC. DenseNet-169 and Inception-ResNet-v2 tied at 0.800 accuracy. ResNet-50 was weakest at 0.762 accuracy and 0.864 AUC. The simple averaging ensemble produced 0.793 accuracy and 0.880 AUC, actually slightly below Xception's individual AUC of 0.896, demonstrating that naive averaging can hurt performance.

Ensemble improvement: The proposed locally connected ensemble achieved 0.851 accuracy, 0.769 precision, 0.741 F1-score, and 0.913 AUC. Compared to the best individual network (Xception), this represents a 4.1 percentage point improvement in accuracy and a 1.7 percentage point improvement in AUC. Compared to the averaging ensemble, the improvement is 5.8 points in accuracy and 3.3 points in AUC. The only metric where the ensemble slightly underperformed was recall, where it scored 0.715 versus Xception's 0.748, a gap of 0.033.

The implementation was built in Keras on a GeForce RTX 2080Ti GPU. All images were resized to 224x224 after segmentation. The Adam optimizer was used with an initial learning rate of 0.0001, 100 epochs maximum, and early stopping with a patience of 10 epochs to prevent overfitting.

TL;DR: On the ISIC 2017 three-class task, the ensemble reached 0.851 accuracy and 0.913 AUC, beating Xception alone (0.810 accuracy, 0.896 AUC) and the averaging method (0.793 accuracy, 0.880 AUC). Training used Adam optimizer, learning rate 0.0001, and early stopping at 10 epochs patience.
Pages 8-10
Binary Classification and Comparison with State-of-the-Art

Binary task performance: The ISIC 2017 challenge originally defined two binary tasks: melanoma vs. others and seborrheic keratosis vs. others. Averaging results across both tasks, the ensemble achieved 0.909 accuracy, 0.859 precision, 0.808 recall, 0.828 F1-score, and 0.911 AUC. For melanoma classification specifically, the ensemble outperformed all individual networks across every metric, with precision exceeding the second-best model (DenseNet) by more than 10 percentage points and F1-score leading by approximately 5 points.

Comparison with machine learning ensembles: The authors also compared their locally connected ensemble against traditional machine learning ensemble methods: SVC, Random Forest, Extra-Trees, KNN, and Gradient Boost Decision Tree (GBDT). While Random Forest achieved a marginally higher accuracy (0.912 vs. 0.909), the proposed method significantly outperformed all ML ensembles in precision (0.859 vs. best ML of 0.808), recall (0.808 vs. best ML of 0.664), F1-score (0.828 vs. best ML of 0.721), and AUC (0.911 vs. best ML of 0.816). The ML methods suffered particularly on recall, with scores ranging from 0.644 to 0.664, indicating they missed many positive cases.

Comparison with ISIC 2017 challenge leaders: Benchmarking against the top five ISIC 2017 challenge entries and recent published methods, the ensemble achieved the highest accuracy (0.909) and precision (0.859) among all compared methods. On AUC, it scored 0.911, which matched the challenge's first-place entry (0.911) and exceeded all other challenge participants. Notably, most challenge entrants used external training data, while this method used only the official ISIC 2017 dataset. The method by Datta et al. (2021) achieved a higher AUC of 0.959, but with lower accuracy (0.833) and no reported precision. Xie et al. (2020) reached 0.938 AUC but only 0.904 accuracy.

TL;DR: On binary tasks, the ensemble achieved 0.909 accuracy, 0.859 precision, and 0.911 AUC. It matched the ISIC 2017 challenge top-1 AUC (0.911) without external data. It outperformed ML ensemble methods (Random Forest, GBDT, etc.) by over 9 points in AUC and 14 points in recall.
Page 10
Constraints and Opportunities for Improvement

Recall weakness: The most notable limitation is the model's relatively lower recall performance. On the binary classification tasks, recall was 0.808 compared to precision of 0.859, indicating the system misses some true positive melanoma cases. In a clinical screening context, missed melanomas (false negatives) are more dangerous than false alarms, so improving sensitivity should be a priority. In the multi-class setting, recall dropped further to 0.715, below Xception's 0.748.

Dataset constraints: The ISIC 2017 dataset contains only 2,750 total images with severe class imbalance: benign nevi account for 68.6% of training data, while melanoma and seborrheic keratosis represent just 18.7% and 12.7%, respectively. This small, imbalanced dataset limits generalization. The authors did not use external training data, which could have improved results, and they did not evaluate on newer, larger ISIC datasets (2018, 2019) or multi-center clinical cohorts.

Single-dataset evaluation: All results come from a single benchmark dataset. Without validation on external clinical data from different imaging devices, patient demographics, or clinical settings, it is unclear how this method would perform in real-world deployment. The absence of cross-dataset transfer experiments is a significant gap, as dermoscopy image characteristics can vary substantially across institutions and devices.

Future directions: The authors note that preprocessing, particularly the segmentation and resize step, had the largest single impact on classification accuracy (a 9.3-point accuracy improvement). They plan to explore more effective classification methods that leverage characteristics specific to dermoscopy images and the relationships between different lesion classes. Additional opportunities include testing on larger and more diverse datasets, incorporating clinical metadata (patient age, lesion location, lesion history), exploring transformer architectures, and developing explainability methods to support clinical adoption.

TL;DR: Key limitations include lower recall (0.808 on binary, 0.715 on multi-class), a small and imbalanced single-benchmark dataset (2,750 images, 68.6% benign nevi), and no external validation. Future work targets better preprocessing, larger multi-center datasets, and clinical metadata integration.
Citation: Ding J, Song J, Li J, Tang J, Guo F.. Open Access, 2021. Available at: PMC8804371. DOI: 10.3389/fbioe.2021.758495. License: cc by.