Two Distinct Tasks
Image understanding splits into two fundamentally different problems:
Object Detection: locate and classify distinct objects in an image. Output is a set of bounding boxes — rectangular regions — each labeled with a class and confidence score. Faster and simpler than segmentation because rectangles are a coarse approximation of object boundaries.
Semantic Segmentation: classify every pixel in the image. Output is a dense label map the same size as the input image. Slower and harder because the model must make a prediction for each of the millions of pixels, maintaining spatial precision throughout.
The right task depends on what downstream use needs: detection suffices for counting objects or finding approximate locations; segmentation is required for precise boundary delineation (medical imaging, autonomous driving scene understanding).
Object Detection Architectures
R-CNN family (Region-based):
- R-CNN: propose regions using selective search, warp each region, classify with CNN. Accurate but slow — each region is processed independently.
- Fast R-CNN: compute CNN features on the full image once, then extract per-region features from the feature map. Major speedup.
- Faster R-CNN: replace selective search with a Region Proposal Network (RPN) that shares CNN features with the detector. End-to-end trainable, near real-time.
YOLO (You Only Look Once): frames detection as a single regression problem — the image is divided into a grid, and each cell predicts bounding boxes and class probabilities simultaneously. Extremely fast (real-time), trading some precision for speed. Subsequent versions (YOLOv5, YOLOv8, etc.) close much of the accuracy gap.
SSD (Single Shot Detector): similar single-pass philosophy to YOLO but uses multiple feature maps at different scales to detect objects of different sizes. Good balance of speed and accuracy.
RetinaNet: addresses the class imbalance problem inherent in dense detection (far more background anchors than object anchors). Uses Focal Loss to down-weight easy background examples and focus training on hard cases. State-of-the-art accuracy at the time of introduction.
Semantic Segmentation Architectures
FCN (Fully Convolutional Network): replaces fully connected layers in classification CNNs with convolutional layers, enabling output at spatial resolution. The first practical end-to-end trainable segmentation architecture.
U-Net: encoder-decoder architecture with skip connections between corresponding encoder and decoder layers. The skip connections preserve fine spatial detail lost during downsampling — critical for biomedical image segmentation where precise boundary location matters. Standard baseline for medical imaging tasks.
Mask R-CNN: extends Faster R-CNN by adding a segmentation head that predicts a binary mask for each detected object. Produces instance segmentation (separate masks for each object instance, not just class-level pixel labels). The RoIAlign layer (replacing RoIPool) improves mask accuracy by avoiding quantization artifacts.
SegNet: encoder-decoder architecture using max-pooling indices from the encoder to guide upsampling in the decoder. More memory-efficient than U-Net because it doesn’t store full feature maps for skip connections.
Instance vs. Semantic Segmentation
Two variants of segmentation are often confused:
Semantic segmentation assigns a class label to every pixel but does not distinguish between separate instances. If two cars overlap, all their pixels get labeled “car” — you cannot tell where car 1 ends and car 2 begins.
Instance segmentation identifies separate object instances. Each car gets its own mask. Mask R-CNN is the dominant approach: it runs Faster R-CNN for detection, then adds a per-RoI binary mask prediction head applied to each detected bounding box independently.
Panoptic segmentation combines both: semantic labels for background elements (road, sky, vegetation) and instance-level masks for foreground objects (cars, pedestrians). The panoptic quality (PQ) metric combines detection and segmentation quality into a single score.
Performance Metrics
Detection metrics:
- IoU (Intersection over Union): area of overlap divided by area of union between predicted and ground-truth bounding boxes. A prediction is correct if IoU is at or above 0.5 (or 0.75 for stricter evaluation).
- mAP (mean Average Precision): average precision across all classes and IoU thresholds. The standard detection benchmark metric on COCO and Pascal VOC.
- AU-PR curve: precision-recall curve per class; area under the curve is average precision.
Segmentation metrics:
- Pixel accuracy: fraction of correctly labeled pixels. Misleading on imbalanced classes (background dominates).
- Mean IoU (mIoU): average IoU across all classes. Standard for semantic segmentation benchmarks.
- PSNR / SSIM: for reconstruction tasks (super-resolution, denoising) rather than classification.
Data Augmentation for Computer Vision
Image models are sensitive to the distribution of training images. Augmentation artificially expands the training set and improves robustness to real-world variation.
Standard augmentations:
- Geometric: random horizontal flip, random crop and resize, rotation (small angles), affine transforms
- Color jitter: random brightness, contrast, saturation, hue variation — prevents models from relying on exact color values
- CutOut / Random Erasing: mask out rectangular regions of the input — forces the model to use distributed features rather than a single discriminative patch
- MixUp / CutMix: blend two training images and their labels — acts as a strong regularizer
For detection and segmentation, augmentations must be applied consistently to both the image and its labels (bounding boxes must transform alongside the image). Frameworks like Albumentations handle this correctly.
Test-time augmentation (TTA): run inference on multiple augmented versions of the same image, then aggregate predictions. Improves accuracy at the cost of inference time.
Backbone Architectures
Detection and segmentation models are composed of a backbone (feature extractor) and a head (task-specific output layers). The backbone is typically pretrained on ImageNet and the head is trained from scratch or fine-tuned.
ResNet family: residual connections solve the vanishing gradient problem for very deep networks. ResNet-50 and ResNet-101 are the standard backbones in most detection papers.
EfficientNet: compound scaling of depth, width, and resolution. Better accuracy-per-FLOP tradeoff than ResNet for most classification tasks.
Vision Transformers (ViT): apply transformer self-attention to image patches. ViT-Large and ViT-Huge exceed ResNet at scale when pretrained on large datasets. For smaller datasets, ResNet-based backbones often still win.
Swin Transformer: hierarchical ViT with shifted windows. Computationally efficient enough for dense prediction tasks while retaining the benefits of attention. State-of-the-art on COCO detection benchmarks.
The practical rule: for large compute and large data, Swin Transformer or ViT. For constrained resources, ResNet-50 or EfficientNet-B4 with strong augmentation close most of the gap.
Practical Considerations
Detection vs. segmentation cost: segmentation models require significantly more compute. For production systems with latency constraints, start with detection and only move to segmentation if the use case genuinely requires pixel-level precision.
Transfer learning: both families benefit heavily from pretraining on ImageNet (or COCO for detection). Fine-tuning a pretrained backbone requires far less data than training from scratch — typical fine-tuning datasets can be as small as a few thousand annotated images.
Data annotation cost: bounding box annotations cost $0.05–0.20 per box at annotation services. Pixel-level masks for segmentation cost $1–5 per image. At dataset scale (50,000+ images), this difference dominates the project budget. Plan annotation budget before choosing between detection and segmentation.
Choosing the Right Approach
| Requirement | Recommended Approach |
|---|---|
| Real-time inference (>30fps) | YOLOv8 — optimized for speed |
| Highest accuracy, no latency constraint | Swin-L backbone with DINO |
| Medical imaging (limited data) | U-Net with strong augmentation |
| Instance segmentation | Mask R-CNN or Mask2Former |
| Limited GPU budget | EfficientDet or YOLOv8-small |
| Panoptic segmentation | Panoptic-FPN or Mask2Former |
The most common mistake is choosing the most complex architecture rather than the simplest one that meets requirements. A well-tuned YOLOv8 on clean data outperforms a poorly-tuned Swin Transformer on noisy data. Start simple, validate the data pipeline and labeling quality, then scale the model if needed.