Skin cancer is the fifth most common form of cancer globally, and its most aggressive subtype, melanoma, affects over 132,000 new patients worldwide each year according to the World Health Organization. In the United States, skin cancers account for 4% of all malignant neoplasms and roughly 1% of all cancer-related deaths. The Centers for Disease Control and Prevention estimates that treatment for all forms of skin cancer costs at least USD 8 billion annually. Early detection dramatically improves survival, but dermatologists face a significant challenge: over 2,000 dermatological diseases produce similar-looking lesions, and conventional diagnosis relies on histopathology (biopsy), which is invasive, costly, and time-consuming. In the Medicare population alone, over 8.2 million skin biopsies are performed each year to diagnose approximately 2 million skin cancers, meaning many procedures are unnecessary.
Risk disparities: The lifetime risk of developing melanoma is 2.4% in Caucasians, 0.5% in Hispanics, and 0.1% in Black individuals. By age 65, men are twice as likely as women to develop melanoma, and by age 80, three times as likely. Globally, melanoma and non-melanoma skin cancer (NMSC) cases are estimated to exceed 1.7 million new diagnoses in 2025. Melanoma accounts for only 5% of skin cancer cases but is the primary driver of skin cancer mortality. Among NMSC subtypes, basal cell carcinoma (BCC) represents 80-85% of cases, while squamous cell carcinoma (SCC) comprises 15-20%.
The proposed solution: This study investigates a novel approach of integrating deep attention mechanisms (self-attention, soft attention, and hard attention) with the Xception transfer learning architecture for binary classification of skin lesions as benign or malignant. While deep learning and CNNs have shown strong results in skin cancer diagnostics, no prior research had explored combining attention mechanisms with Xception for this binary classification task. The hypothesis is that attention mechanisms can direct the model to focus on the most diagnostically relevant regions of dermoscopic images, rather than treating all image regions equally.
Why Xception: Developed by Chollet, Xception is a depthwise separable convolutional neural network pre-trained on the ImageNet database (over 1 million images). In comparative studies, Xception has consistently outperformed architectures like VGG-16, ResNet, and Inception V3 in traditional classification tasks. On the HAM10000 skin cancer dataset, Xception achieved 90.48% accuracy, surpassing five other transfer learning networks tested in the same study.
The authors conduct a thorough literature review of deep learning and transfer learning methods applied to skin cancer detection. CNN-based approaches have achieved strong results on the HAM10000 dataset. One study using Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) for image enhancement achieved accuracies of 98.77%, 98.36%, and 98.89% across three protocols. A lightweight CNN (LWCNN) achieved 91.05% accuracy with only 22.54 minutes of total processing time. Meanwhile, a novel SCDNet architecture based on VGG16 and CNN achieved 96.91% accuracy on 25,331 ISIC 2019 images, outperforming ResNet50 (95.21%), AlexNet (93.14%), VGG19 (94.25%), and Inception V3 (92.54%).
Transfer learning advances: A MobileNetV2-based model achieved 98.2% accuracy on 33,126 SIIM-ISIC 2020 images. Using modified EfficientNet V2-M on 58,032 dermoscopic images, researchers reached 99.23% binary classification accuracy on ISIC 2020 and 95.95% on HAM10000. An ensemble model combining VGG, CapsNet, and ResNet achieved 93.5% accuracy on 25,000 ISIC images with a training time of just 106 seconds. Six transfer learning networks tested on HAM10000 showed Xception outperforming all others at 90.48% accuracy, 89.57% recall, 88.76% precision, and 89.02% F1 score.
Attention mechanisms in skin cancer: A soft-attention-based CNN (SAB-CNN) classified HAM10000 images at 95.94% accuracy with a 95.30% Matthews correlation coefficient. Soft attention applied across five deep neural networks (ResNet34, ResNet50, Inception ResNet v2, DenseNet201, VGG16) improved performance by 4.7% over baseline, achieving 93.7% precision on HAM10000. On ISIC-2017, soft attention coupling improved sensitivity by 3.8%, reaching 91.6%. A dual-track model using modified DenseNet-169 with a coordinate attention module (CoAM) plus a custom CNN achieved 93.2% accuracy, 95.3% precision, 91.4% recall, and 93.3% F1 score on HAM10000.
The review reveals a critical gap: despite the demonstrated benefits of both Xception and attention mechanisms separately, no prior study had combined them for binary skin cancer classification. This positions the current study as novel in investigating how self-attention, soft attention, and hard attention each interact with the Xception architecture for distinguishing benign from malignant skin lesions.
The study uses the HAM10000 ("Human Against Machine") dataset, a publicly available collection of 10,015 dermoscopic images representing seven types of pigmented skin lesions: actinic keratosis (AKIEC, 327 images), basal cell carcinoma (BCC, 514), benign keratosis (BKL, 1,099), dermatofibroma (DF, 115), melanocytic nevi (NV, 6,705), melanoma (MEL, 1,113), and vascular lesions (VASC, 142). More than 50% of the lesions were verified through histopathology, while the remainder were confirmed via follow-up examination, expert consensus, or in vivo confocal microscopy. All images are in color at 450 x 600 pixels. The dataset is roughly balanced by sex (51.1% male, 48.9% female) with an age distribution showing bimodal peaks at 35-50 and 60-75 years.
Binary relabeling: For this study, the seven-class dataset was collapsed into two binary categories. Melanoma (MEL) and basal cell carcinoma (BCC) were grouped as "cancer" (malignant), totaling 1,954 images (19.56% of the dataset). The remaining five classes (AKIEC, DF, BKL, NV, VASC) were classified as "normal" (benign), totaling 8,061 images (80.49%). This created a heavily imbalanced dataset that required augmentation.
Data augmentation: To address the class imbalance, the cancer class was augmented using three techniques via the Keras ImageDataGenerator: random rotation within a range of 40 degrees, brightness adjustment between 1.0 and 1.3 times the original, and random horizontal and vertical flipping. Three augmented versions were generated for each original cancer image, increasing the cancer class from 1,954 to 7,816 images and the total dataset from 10,015 to 15,877 dermoscopic images.
Preprocessing pipeline: All images were resized from 450 x 600 to 299 x 299 pixels to match the Xception model's default input size. Pixel values were normalized from the 0-255 range to 0-1. Data shuffling was applied to prevent bias during training by ensuring randomness in batch selection. The final dataset was split into 80% for training (12,702 images) and 20% for testing (3,175 images), with 10-fold cross-validation applied during training and evaluation.
The core architecture starts with the pre-trained Xception model serving as a feature extractor. The Xception model was loaded with "include_top" set to false, meaning the fully connected classification layers originally trained for ImageNet were removed and replaced with custom layers tailored to the binary skin cancer task. A GlobalAveragePooling2D layer was added after the base Xception model to reduce spatial dimensions by extracting global features from the feature maps. A dropout layer followed to prevent overfitting and enhance generalization.
Self-attention (SL) layer: This mechanism transforms the input into query (Q), key (K), and value (V) vectors through linear transformations. Attention scores are computed as the dot product of the query with all keys, divided by the square root of the key dimension. These scores are normalized via Softmax to obtain attention weights for the values. This was implemented using Keras's built-in attention layer. Self-attention excels at modeling relationships between distant elements in an image, which is particularly useful for capturing context across an entire dermoscopic image.
Soft attention (SF) layer: This mechanism downweights irrelevant image regions by multiplying corresponding feature maps with low weights (closer to 0), allowing the model to focus on the most diagnostically relevant information. A dense layer with Softmax activation computes attention weights that sum to 1 for each feature, and these weights are applied to the feature map via a dot product operation. Unlike hard attention, soft attention is fully differentiable, meaning it can be trained end-to-end using standard backpropagation.
Hard attention (HD) layer: This mechanism forces the model to make binary decisions about which input components to attend to. It applies a binary mask (0 or 1) to attention scores between queries and keys, assigning 1 to the top-k highest-scoring elements and 0 to everything else. This compels the model to focus exclusively on the most important features while completely disregarding others. The selection process does not involve gradients, making it non-differentiable and requiring reinforcement learning or sampling-based techniques during training.
Classification head: After the attention layer, outputs were flattened and fed into a dense (fully connected) layer. A sigmoid activation function transformed the output into a binary probability (0 or 1). For models with attention mechanisms, L2 regularization (0.001) was applied to the dense layer weights. The base Xception model without attention mechanisms did not use L2 regularization or dropout, as these were deemed unnecessary for the simpler architecture.
All four models (Xception base, Xception-SL, Xception-SF, Xception-HD) were trained using the Adam optimizer with a learning rate of 0.001, a batch size of 32, and 50 epochs across all folds of the 10-fold cross-validation. The loss function was binary cross-entropy with a default probability threshold of 0.5. Models with attention mechanisms used both sigmoid and Softmax activation functions, while the base model used sigmoid only. Early stopping with a patience of five epochs was applied to prevent overfitting, and the best model weights were saved based on decreased validation loss and higher accuracy.
Dropout and regularization: The attention-enhanced models used a dropout rate of 0.7 and L2 regularization with a weight of 0.001. The base Xception model used neither dropout nor L2 regularization. Data shuffling was enabled for all models to prevent order-dependent learning bias.
Evaluation metrics: The study measured six metrics. Accuracy quantified overall correct classifications. Recall (sensitivity/true positive rate) measured how many actual cancer cases were correctly identified, which the authors emphasize as the most critical metric for medical applications because the goal is to minimize missed cancer cases. Precision measured how many predicted cancer cases were actually cancerous. F1 score combined precision and recall as a harmonic mean, making it particularly valuable for imbalanced datasets. Cohen's kappa assessed agreement between model predictions and true labels while accounting for chance agreement. The AUC (area under the ROC curve) measured the model's ability to distinguish between benign and malignant classes, with higher values indicating better class separation.
The false alarm rate (false positive rate) was also computed from the confusion matrices. This metric quantifies how often the model incorrectly classifies benign lesions as malignant, which in a clinical setting would correspond to unnecessary biopsies and patient anxiety. Balancing false alarm rate against recall is one of the central trade-offs in medical AI, because lowering the false alarm rate typically comes at the cost of missing some true cancer cases.
The results clearly demonstrate that integrating attention mechanisms into Xception improves performance across every metric. The base Xception model achieved 91.05% accuracy, 91.68% recall, 90.78% precision, 91.23% F1 score, an AUC of 0.972, and a Cohen's kappa of 0.821. Self-attention (Xception-SL) delivered the best performance overall: 94.11% accuracy, 95.47% recall, 93.10% precision, 94.27% F1 score, an AUC of 0.987, and a Cohen's kappa of 0.882. Soft attention (Xception-SF) came in second: 93.29% accuracy, 95.28% recall, 91.81% precision, 93.51% F1 score, AUC of 0.983, and kappa of 0.865. Hard attention (Xception-HD) was third: 92.97% accuracy, 93.98% recall, 92.32% precision, 93.14% F1 score, AUC of 0.983, and kappa of 0.859.
Confusion matrix analysis: The Xception-SL model correctly classified 1,539 normal images and 1,449 cancerous images, with only 114 false positives and 73 false negatives. Xception-SF correctly classified 1,536 normal and 1,426 cancerous images (137 false positives, 76 false negatives). Xception-HD correctly identified 1,515 normal and 1,437 cancerous images (126 false positives, 97 false negatives). The base Xception model had the worst error distribution: 1,478 normal and 1,413 cancerous correctly identified, but 150 false positives and 134 false negatives.
False alarm rates: Xception-SL achieved the lowest false alarm rate at 6.90%. Xception-HD had a false alarm rate of 7.68%, and Xception-SF was at 8.19%. The base Xception had the highest false alarm rate at 9.21%. The self-attention model's combination of the highest recall (95.47%) and the lowest false alarm rate (6.90%) makes it particularly attractive for clinical deployment, where both missed cancers and unnecessary biopsies carry significant costs.
AUC and agreement: All three attention-enhanced models converged at AUC scores of 0.983-0.987, while the base Xception was slightly lower at 0.972. Cohen's kappa scores ranged from 0.821 (base) to 0.882 (Xception-SL), all indicating substantial agreement between predictions and ground truth. The convergence of the three attention-enhanced models suggests that regardless of the specific attention mechanism used, directing the model to focus selectively on image regions provides a consistent performance benefit over treating all regions equally.
The authors compare their results against two recent and closely related studies that also used the HAM10000 dataset with binary relabeling (malignant vs. benign) and transfer learning approaches. The first comparator used modified EfficientNet V2-M and EfficientNet-B4, achieving the highest accuracy of 95.95% but with notably lower recall (94%), precision (83%), and F1 score (88%). While the EfficientNet approach slightly outperformed Xception-SL on accuracy (95.95% vs. 94.11%), all three proposed attention-enhanced models outperformed it on recall, precision, and F1 score. Xception-SL achieved 95.47% recall vs. 94%, 93.10% precision vs. 83%, and 94.27% F1 vs. 88%. The AUC improvement was also substantial, with Xception attention models exceeding 0.983 compared to 0.980.
The second comparator used a modified DenseNet-169 with a coordinate attention module (CoAM) combined with a customized CNN, achieving 93.2% accuracy, 95.3% precision, 91.4% recall, and 93.3% F1 score. Both Xception-SL (94.11% accuracy, 94.27% F1) and Xception-SF (93.29% accuracy, 93.51% F1) outperformed this approach in accuracy and F1 score. Critically, all four proposed models outperformed the DenseNet-169/CoAM approach in recall: Xception-SL at 95.47%, Xception-SF at 95.28%, Xception-HD at 93.98%, and even the base Xception at 91.68%, compared to the DenseNet study's 91.4%.
Clinical significance of recall: The authors emphasize that recall is arguably the most important metric in medical applications because it measures how many actual cancer cases the model correctly identifies. A model with high accuracy but low recall might miss real cancers, leading to delayed diagnoses and worse patient outcomes. In clinical terms, minimizing false negatives is essential. The fact that both Xception-SL and Xception-SF achieved recall above 95% while maintaining competitive accuracy and precision suggests these models could meaningfully reduce missed diagnoses in a clinical screening workflow.
The comparison also demonstrates that the specific type of attention mechanism matters. Self-attention's ability to model long-range dependencies across the entire image appears particularly well-suited to dermoscopic image analysis, where the spatial relationships between lesion features, borders, and surrounding skin all contribute to diagnostic accuracy.
Dataset limitations: The primary limitation is the constrained size and diversity of the HAM10000 dataset (10,015 original images), which led to overfitting. While the attention-enhanced models performed well on training data, test set performance was lower than expected, indicating generalization challenges. Although three anti-overfitting techniques were employed (L2 regularization, early stopping with patience of 5 epochs, and dropout at 0.7), overfitting remained a concern. The augmentation strategy expanded the cancer class from 1,954 to 7,816 images, but this was done using relatively simple transformations (rotation, brightness, flipping) that may not introduce sufficient visual diversity.
Data quality issues: The HAM10000 dataset contains image noise from varying lighting conditions, different capture devices, inconsistent image resolution, and variable clarity of lesion boundaries. These inconsistencies may have hindered the models' ability to learn relevant features and could reduce classification accuracy. The authors suggest that continuous development of noise-filtering techniques could improve image quality and model performance. The study is also limited to a single dataset, so it remains unclear how well the models would generalize to images from different clinical settings, populations, or imaging devices.
Computational cost: The high computational resources required for training and evaluating four models with 10-fold cross-validation represent a practical limitation, especially as the fine-tuning architecture complexity increases with the addition of attention layers. This is an important consideration for potential clinical deployment, where inference speed and hardware requirements matter.
Future directions: The authors outline several planned improvements. They intend to experiment with larger combined datasets and use generative adversarial networks (GANs) to synthesize realistic skin lesion images, which would address both the size and diversity limitations. Noise filtering techniques will be explored to improve input image quality. Additional attention mechanisms beyond the three tested will be investigated. The authors also plan to evaluate alternative transfer learning backbones, specifically EfficientNet and ResNet, and explore ensemble methods that combine multiple architectures. Weight decay strategies will be investigated as an additional regularization approach.