Adapting AI to See What Doctors See: Zero-Shot Segmentation Gets a Boost

Author: Denis Avetisyan

A new framework enhances the Segment Anything Model’s ability to accurately identify structures in medical images without requiring task-specific training.

Existing segmentation approaches typically demand task-specific refinement, whether through full parameter updates or parameter-efficient tuning, but a novel method achieves task-agnostic adaptation at inference time via prompt injection and boundary-aware alignment-effectively persuading the model without the need for further training on labeled data.

Boundary-Aware Test-Time Adaptation leverages prompt engineering and feature alignment to improve zero-shot medical image segmentation performance.

Despite the promise of foundation models for medical imaging, adapting them to new tasks remains challenging due to limited annotated data and domain shifts. This paper introduces ‘Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation’, a novel framework, BA-TTA-SAM, designed to enhance the Segment Anything Model’s zero-shot performance on medical images. By integrating Gaussian prompt injection and cross-layer boundary-aware attention, BA-TTA-SAM aligns deep semantic features with crucial boundary cues-achieving a 12.4

The Shifting Sands of Perception: From CNNs to Vision Transformers

For years, convolutional neural networks (CNNs) dominated the field of medical image analysis, effectively identifying patterns within scans to assist in diagnosis and treatment planning. However, these networks inherently process images in localized regions, relying on small filters to detect features. This approach creates challenges when analyzing images requiring an understanding of relationships between distant parts – a phenomenon known as struggling with long-range dependencies. Consider a scan searching for a subtle anomaly spread across a large organ; a CNN might successfully identify local textures but fail to integrate these observations into a cohesive understanding of the entire structure. This limitation prompted researchers to explore architectures capable of capturing these broader contextual relationships, ultimately paving the way for the adoption of attention-based mechanisms and the rise of Vision Transformers in medical imaging.

The Vision Transformer (ViT) marked a significant departure from conventional convolutional neural networks in medical image analysis by adapting the transformer architecture – originally designed for natural language processing – to process images as sequences of patches. Rather than relying on spatially local receptive fields, ViT utilizes a self-attention mechanism, allowing each patch to attend to all other patches in the image, thereby capturing long-range dependencies crucial for understanding complex anatomical structures. This global context awareness enables the model to extract more robust and informative features, particularly in scenarios where subtle relationships across distant image regions are indicative of pathology. By focusing on relationships between image parts, the ViT effectively sidesteps the limitations of CNNs in modeling non-local interactions, opening new avenues for improved accuracy and efficiency in medical image segmentation and diagnosis.

Recent advancements in medical image segmentation increasingly utilize hybrid architectures that strategically combine convolutional neural networks (CNNs) and Vision Transformers. These models, such as TransUNet and HiFormer, aim to overcome the limitations of each individual approach; CNNs excel at capturing local features and spatial hierarchies, while Vision Transformers demonstrate superior ability in modeling long-range dependencies within images. By integrating these strengths, hybrid architectures can more effectively delineate complex anatomical structures and lesions. For instance, a typical implementation might employ a CNN to extract initial low-level features, followed by a Transformer encoder to capture global contextual information, ultimately leading to improved segmentation accuracy and robustness compared to either CNNs or Transformers alone. This synergistic approach is proving particularly valuable in applications requiring precise identification of subtle details within medical scans, offering a promising path towards more accurate diagnoses and treatment planning.

Our BA-TTA-SAM approach demonstrates improved segmentation performance across challenging medical datasets compared to SAM and existing TTA baselines.

Bridging the Gap: The Promise of Test-Time Adaptation

Performance degradation of machine learning models is commonly observed when transitioning between datasets, a phenomenon known as domain shift. This is particularly problematic in medical imaging, where variations in image acquisition protocols, patient demographics, and annotation styles across datasets like ISIC2017 (skin lesion segmentation), Kvasir-SEG (colorectal polyp segmentation), BUSI (breast ultrasound segmentation), and REFUGE (retinal fundus gland segmentation) can significantly reduce model accuracy and reliability. These discrepancies hinder the clinical translation of models trained on one dataset when applied to data from a different, previously unseen source, necessitating robust methods to address domain generalization and adaptation.

Test-Time Adaptation (TTA) addresses the problem of domain shift, where discrepancies between training and testing data distributions reduce model accuracy. Unlike traditional fine-tuning, TTA modifies model behavior during inference without altering the underlying model weights. This is achieved through techniques that leverage the characteristics of the input test data to adjust the model’s predictions, effectively adapting it to the unseen domain. The key benefit is maintaining a single, pre-trained model while still achieving improved performance across diverse datasets, avoiding the computational cost and potential overfitting associated with retraining or storing multiple specialized models. This approach is particularly useful in medical imaging, where data acquisition protocols and patient populations can vary significantly between institutions.

BA-TTA-SAM is a test-time adaptation framework leveraging the Segment Anything Model (SAM) and prompt engineering to address domain shift. The method operates by generating prompts based on the input test image and utilizing SAM to produce segmentation masks. These masks are then refined through a batch-aware training loop – BA-TTA – which minimizes the entropy of predictions and encourages consistent segmentations within a batch of test images. This process allows the model to adapt to the characteristics of the unseen test data without updating model weights, improving generalization performance across datasets like ISIC2017, Kvasir-SEG, BUSI, and REFUGE. The framework specifically optimizes prompt selection and mask refinement to maximize SAM’s performance in adapting to new domains.

Our proposed BA-TTA-SAM utilizes prompt injection and boundary alignment to guide spatial representation learning and enforce consistency between shallow and deep image embeddings.

Whispers to the Machine: Precision Prompting and SAM

Prompt Injection in Segment Anything Model (SAM) leverages external information to modify the segmentation process beyond the initial image and prompt. This is achieved by introducing additional input – the ‘prompt’ – which isn’t a traditional bounding box or mask, but rather data that influences the attention mechanisms within SAM’s encoder and decoder. By manipulating this prompt, the model can adapt to novel image characteristics, such as varying lighting conditions, unusual textures, or objects not commonly found in its training data. This allows for segmentation of images with features outside of the model’s initial scope, effectively extending its generalization capabilities without requiring retraining of the core model weights. The injected prompt acts as a conditional input, steering the segmentation towards desired outcomes based on the provided external information.

Gaussian Prompt Injection and Encoder-Level Prompt Injection represent methods for modulating the attention maps within the Segment Anything Model (SAM). Gaussian Prompt Injection introduces a Gaussian distribution into the prompt embedding, influencing attention towards specific image regions and allowing for spatially-weighted guidance. Encoder-Level Prompt Injection operates directly on the encoder features, modifying the attention weights at an earlier stage of processing and enabling more granular control over feature interactions. Both techniques allow users to refine segmentation boundaries by emphasizing or suppressing attention to particular areas, improving performance on challenging images or when requiring precise localization of segmented objects, and are distinct from simple masking approaches by operating on the model’s internal attention mechanisms rather than the image itself.

Boundary-Aware Alignment improves segmentation accuracy by explicitly addressing feature discrepancies between deep and shallow layers within the Segment Anything Model (SAM). Specifically, this technique aligns deep, semantically rich features with the finer-grained details captured by shallow layers. This alignment process mitigates the loss of boundary information that can occur during the propagation of features through the network. By effectively combining these feature representations, Boundary-Aware Alignment facilitates the generation of more precise segmentation boundaries and enhances the overall detail captured in the resulting segmentation masks. The method utilizes a dedicated alignment module to minimize the discrepancy between these feature levels, thereby improving localization performance and the quality of the final segmentation output.

Prompt injection enhances Grad-CAM visualizations, providing clearer focus on relevant regions within the REFUGE dataset compared to the original Segment Anything Model.

The Measure of Understanding: Validating Segmentation Accuracy

Precise delineation of anatomical structures and pathological regions within medical images is fundamentally critical for accurate diagnosis and effective treatment planning. Consequently, the performance of image segmentation algorithms is rigorously evaluated using quantitative metrics, most notably the $Dice Similarity Coefficient$ and $Mean Intersection over Union$ (mIoU). The $Dice$ coefficient measures the overlap between the predicted segmentation and the ground truth, providing a value between 0 and 1, where higher scores indicate greater similarity. Similarly, mIoU calculates the average overlap between predicted and actual segmentations across all classes, offering another robust assessment of accuracy. These metrics aren’t merely academic exercises; they directly correlate with the reliability of clinical decisions, making consistent and high-performing segmentation a cornerstone of modern medical imaging.

The application of BA-TTA-SAM, a novel approach leveraging prompt engineering, demonstrates consistent gains in medical image segmentation across a spectrum of datasets-including ISIC2017 for skin lesion analysis, Kvasir-SEG for colorectal polyp detection, BUSI for breast ultrasound, and REFUGE for retinal vessel segmentation. Rigorous evaluation using the Dice Similarity Coefficient and Mean Intersection over Union (mIoU) revealed an average Dice score of 89.7

The refinement in segmentation accuracy achieved through this methodology extends beyond mere numerical gains; it directly impacts the precision of medical diagnoses and the effectiveness of treatment strategies. Demonstrating an average Dice score of 89.7

Average inference time per image across the ISIC, Kvasir, BUSI, and REFUGE datasets is represented by the bar heights shown.

The pursuit of precise delineation within medical imagery feels less like solving an equation and more like coaxing form from static. This work, with its boundary-aware adaptation, doesn’t seek to correct the Segment Anything Model, but to persuade it. It whispers to the network, guiding its attention with Gaussian prompts, aligning features at the edges of structures. As Yann LeCun once observed, “Everything is an approximation.” This isn’t about achieving perfect pixel-level accuracy, an impossible ghost, but about crafting a model that gracefully navigates the inherent ambiguity of biological forms, accepting the fuzziness as part of the truth. The model doesn’t know the boundaries, it simply learns to believe in them, a subtle, yet profound distinction.

What Remains Unseen?

The exercise of coaxing a generalist model – one trained on the sprawling chaos of everything – to speak the precise language of medical imaging feels less like progress and more like a temporary détente. BA-TTA-SAM manages the translation with a clever injection of Gaussian priors and boundary alignment, but it doesn’t erase the fundamental disconnect. The model isn’t understanding anatomy; it’s learning a convincing mimicry. And mimicry, as any seasoned analyst knows, fails predictably when confronted with genuine novelty.

Future work will undoubtedly refine the prompt engineering – a process that increasingly resembles divination. But the real challenge lies in escaping the tyranny of pre-training. Until the field acknowledges that data isn’t truth, but a truce between a bug and Excel, it will remain trapped in a cycle of adaptation, not innovation. The question isn’t whether these models can segment an image, but whether they can tolerate being wrong, and more importantly, consistently wrong in a useful way.

Perhaps the most interesting path forward isn’t about making the model see better, but about learning to interpret its illusions. Everything unnormalized is still alive, after all, and within that noise might lie signals more subtle, and more clinically relevant, than any neatly labeled dataset could ever provide.

Original article: https://arxiv.org/pdf/2512.04520.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Shifting Sands of Perception: From CNNs to Vision Transformers

Bridging the Gap: The Promise of Test-Time Adaptation

Whispers to the Machine: Precision Prompting and SAM

The Measure of Understanding: Validating Segmentation Accuracy

What Remains Unseen?

See also: