Hybrid AI for X-ray Object Detection
Automated X-ray screening is a vital component of public safety, serving as a crucial task for detecting prohibited items. However, this process faces significant challenges, including object occlusion, cluttered scenes, high variation within object classes, and the inherent visual ambiguity present in grayscale X-ray images. These difficulties can lead to fatigue and errors when relying solely on manual inspection, particularly in high-throughput environments.
In recent years, deep learning (DL) methods, specifically Convolutional Neural Networks (CNNs), have driven breakthroughs in generic object detection and have introduced notable advances in automatic object detection within X-ray images. The literature in X-ray security imaging has historically been dominated by the use of CNN-based methods. Examples include adaptations of popular CNN detectors like DOAM and LIM, and various enhancements to YOLO variants such as EM-YOLO, SC-YOLOv8, and others, aiming to address challenges like occlusion, low contrast, class imbalance, and real-time analysis requirements.
In parallel, Vision Transformers (ViTs) have emerged, demonstrating increased capabilities in modeling global visual context by attending to whole image patches. DETR and its variants like Sparse DETR and DINO have formulated object detection as a set prediction problem using transformers. More recently, hybrid CNN-transformer architectures have been introduced to combine the complementary strengths of both architectures. CNNs excel at capturing local detail, while ViTs are adept at capturing long-range relationships across a visual scene. This combination has shown promise for various computer vision tasks. Next-ViT-S is highlighted as a leading example of a hybrid model that alternates specialized convolution blocks with attention-based ones, designed for efficient deployment and outperforming comparable CNNs and ViTs on standard benchmarks.
Despite the success of ViTs and hybrid architectures in natural image analysis, the X-ray imaging community has remained primarily focused on CNN-based approaches. The main reasons for this limited adoption include the fact that ViT components often require large training datasets and can be computationally heavy. Lightweight adaptations and Neural Architecture Search are being explored to address the computational overhead. The demand for large datasets is another obstacle, given the lack of sufficiently large X-ray datasets and potential biases in publicly available ones. Techniques like semi-supervised learning and synthetic data augmentation, along with cross-domain pretraining and few-shot learning, are being investigated to reduce reliance on large labeled datasets for hybrid architectures.
This paper investigates the application of hybrid CNN-transformer architectures for detecting illicit objects in X-ray inspection imaging, specifically evaluating their performance against a typical CNN-only detection baseline. The study forms various hybrid architectures by combining CNN (HGNetV2) and hybrid CNN-transformer (Next-ViT-S) backbones with different CNN (YOLOv8) and transformer (RT-DETR) detection heads. These are evaluated against a common CNN-only baseline: YOLOv8 with its default CSP-DarkNet53 backbone.