Lesson 2.3: Object Detection

Conventional Object Detection

A fixed-size detection window slides across the image at multiple scales and aspect ratios.
At each position, a classifier (e.g., SVM, CNN) checks if an object is present.
Such as Haar, SIFT, HOG etc.
DPM (Deformable Parts Model) models, Employed a sliding window approach to predict bounding boxes with higher scores.
This kind of method was extremely time-consuming, and the accuracy was not very high.

Number of Scales
- Objects appear at different sizes, requiring the window to resize and scan repeatedly.
- Example: A single image may need 10+ scale checks → more computations.
Number of Aspect Ratios
- Objects vary in shape (wide cars, tall pedestrians).
- Checking multiple aspect ratios (e.g., 1:1, 2:3, 3:2) increases processing time.
Sliding Across Width & Height
- The window moves pixel-by-pixel (or with a stride), leading to thousands of overlapping patches.
- Example: A 1000×1000px image with stride=16 → 3,900+ positions per scale!

Speed Up the Deep Network Computation (Faster Processing per Window)
- Problem: Running a CNN on every sliding window is slow.
- Solution: Optimize the CNN for speed while maintaining accuracy.
- Use lighter architectures (e.g., MobileNet, SqueezeNet) for the sliding window classifier.
- Apply network pruning or quantization to reduce computation.
- Limitation: Even with a fast CNN, processing thousands of windows is inefficient.
Reduce the Number of Proposals (Pass Only Relevant Windows to CNN)
- Problem: Most sliding windows contain just background (no objects).
- Solution: Use a cheap pre-filtering method to eliminate useless windows before CNN processing.
- Traditional method: Selective Search (R-CNN) generates fewer but higher-quality proposals.
- Modern method: Region Proposal Network (RPN in Faster R-CNN) predicts likely object regions.
- Benefit: Instead of thousands of windows, only ~200-2000 proposals are passed to CNN → Major speedup!

In R-CNN (Region-based CNN), the object detection pipeline is broken into three key steps to improve efficiency over the sliding window approach:

Region Proposal Generation (Category-Independent)
- Goal: Generate ~2000 regions ("proposals") that might contain objects (without knowing what they are yet).
- Method: Uses Selective Search (a fast, heuristic-based algorithm) to:
  - Merge similar pixels → forms potential object regions.
  - Prioritizes regions with uniform texture/color (likely objects).
  - Why? Avoids brute-force sliding windows by focusing only on "promising" areas.
Feature Extraction (CNN)
- Each proposed region is warped to a fixed size and fed into a CNN (e.g., AlexNet) to extract features.
- Key Difference from Sliding Window:
  - Processes ~2000 regions instead of millions of sliding windows.
Classification (SVM) & Bounding Box Refinement
- Classification: An SVM assigns a class label (e.g., "car," "dog") to each region.
- Bounding Box Regression: Adjusts proposal coordinates for tighter object fits.