Writing ml
ml 12 min read 1 February 2023

Deep Learning for Image Tasks: Detection vs. Segmentation

A practical map of the deep learning landscape for image understanding — object detection vs. semantic segmentation, the key architectures for each, and which metrics to use.

Two Distinct Tasks

Image understanding splits into two fundamentally different problems:

Object Detection: locate and classify distinct objects in an image. Output is a set of bounding boxes — rectangular regions — each labeled with a class and confidence score. Faster and simpler than segmentation because rectangles are a coarse approximation of object boundaries.

Semantic Segmentation: classify every pixel in the image. Output is a dense label map the same size as the input image. Slower and harder because the model must make a prediction for each of the millions of pixels, maintaining spatial precision throughout.

The right task depends on what downstream use needs: detection suffices for counting objects or finding approximate locations; segmentation is required for precise boundary delineation (medical imaging, autonomous driving scene understanding).

Object Detection Architectures

R-CNN family (Region-based):

YOLO (You Only Look Once): frames detection as a single regression problem — the image is divided into a grid, and each cell predicts bounding boxes and class probabilities simultaneously. Extremely fast (real-time), trading some precision for speed. Subsequent versions (YOLOv5, YOLOv8, etc.) close much of the accuracy gap.

SSD (Single Shot Detector): similar single-pass philosophy to YOLO but uses multiple feature maps at different scales to detect objects of different sizes. Good balance of speed and accuracy.

RetinaNet: addresses the class imbalance problem inherent in dense detection (far more background anchors than object anchors). Uses Focal Loss to down-weight easy background examples and focus training on hard cases. State-of-the-art accuracy at the time of introduction.

Semantic Segmentation Architectures

FCN (Fully Convolutional Network): replaces fully connected layers in classification CNNs with convolutional layers, enabling output at spatial resolution. The first practical end-to-end trainable segmentation architecture.

U-Net: encoder-decoder architecture with skip connections between corresponding encoder and decoder layers. The skip connections preserve fine spatial detail lost during downsampling — critical for biomedical image segmentation where precise boundary location matters. Standard baseline for medical imaging tasks.

Mask R-CNN: extends Faster R-CNN by adding a segmentation head that predicts a binary mask for each detected object. Produces instance segmentation (separate masks for each object instance, not just class-level pixel labels). The RoIAlign layer (replacing RoIPool) improves mask accuracy by avoiding quantization artifacts.

SegNet: encoder-decoder architecture using max-pooling indices from the encoder to guide upsampling in the decoder. More memory-efficient than U-Net because it doesn’t store full feature maps for skip connections.

Instance vs. Semantic Segmentation

Two variants of segmentation are often confused:

Semantic segmentation assigns a class label to every pixel but does not distinguish between separate instances. If two cars overlap, all their pixels get labeled “car” — you cannot tell where car 1 ends and car 2 begins.

Instance segmentation identifies separate object instances. Each car gets its own mask. Mask R-CNN is the dominant approach: it runs Faster R-CNN for detection, then adds a per-RoI binary mask prediction head applied to each detected bounding box independently.

Panoptic segmentation combines both: semantic labels for background elements (road, sky, vegetation) and instance-level masks for foreground objects (cars, pedestrians). The panoptic quality (PQ) metric combines detection and segmentation quality into a single score.

Performance Metrics

Detection metrics:

Segmentation metrics:

Data Augmentation for Computer Vision

Image models are sensitive to the distribution of training images. Augmentation artificially expands the training set and improves robustness to real-world variation.

Standard augmentations:

For detection and segmentation, augmentations must be applied consistently to both the image and its labels (bounding boxes must transform alongside the image). Frameworks like Albumentations handle this correctly.

Test-time augmentation (TTA): run inference on multiple augmented versions of the same image, then aggregate predictions. Improves accuracy at the cost of inference time.

Backbone Architectures

Detection and segmentation models are composed of a backbone (feature extractor) and a head (task-specific output layers). The backbone is typically pretrained on ImageNet and the head is trained from scratch or fine-tuned.

ResNet family: residual connections solve the vanishing gradient problem for very deep networks. ResNet-50 and ResNet-101 are the standard backbones in most detection papers.

EfficientNet: compound scaling of depth, width, and resolution. Better accuracy-per-FLOP tradeoff than ResNet for most classification tasks.

Vision Transformers (ViT): apply transformer self-attention to image patches. ViT-Large and ViT-Huge exceed ResNet at scale when pretrained on large datasets. For smaller datasets, ResNet-based backbones often still win.

Swin Transformer: hierarchical ViT with shifted windows. Computationally efficient enough for dense prediction tasks while retaining the benefits of attention. State-of-the-art on COCO detection benchmarks.

The practical rule: for large compute and large data, Swin Transformer or ViT. For constrained resources, ResNet-50 or EfficientNet-B4 with strong augmentation close most of the gap.

Practical Considerations

Detection vs. segmentation cost: segmentation models require significantly more compute. For production systems with latency constraints, start with detection and only move to segmentation if the use case genuinely requires pixel-level precision.

Transfer learning: both families benefit heavily from pretraining on ImageNet (or COCO for detection). Fine-tuning a pretrained backbone requires far less data than training from scratch — typical fine-tuning datasets can be as small as a few thousand annotated images.

Data annotation cost: bounding box annotations cost $0.05–0.20 per box at annotation services. Pixel-level masks for segmentation cost $1–5 per image. At dataset scale (50,000+ images), this difference dominates the project budget. Plan annotation budget before choosing between detection and segmentation.

Choosing the Right Approach

RequirementRecommended Approach
Real-time inference (>30fps)YOLOv8 — optimized for speed
Highest accuracy, no latency constraintSwin-L backbone with DINO
Medical imaging (limited data)U-Net with strong augmentation
Instance segmentationMask R-CNN or Mask2Former
Limited GPU budgetEfficientDet or YOLOv8-small
Panoptic segmentationPanoptic-FPN or Mask2Former

The most common mistake is choosing the most complex architecture rather than the simplest one that meets requirements. A well-tuned YOLOv8 on clean data outperforms a poorly-tuned Swin Transformer on noisy data. Start simple, validate the data pipeline and labeling quality, then scale the model if needed.

computer-vision deep-learning object-detection segmentation ml
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →