Lesson 2.2: Computer Vision Tasks


Computer vision is a field of artificial intelligence (AI) that enables machines to interpret and understand visual information from the world, such as images and videos. It involves various tasks that mimic human visual perception. Here are some key computer vision tasks:

1. Image Classification

  • Goal: Assign a label to an entire image (e.g., "cat," "dog," "car").
  • Example: Identifying whether an image contains a cat or a dog.
  • Models: CNN (Convolutional Neural Networks), ResNet, ViT (Vision Transformer).

2. Object Detection

  • Goal: Locate and classify multiple objects within an image using bounding boxes.
  • Example: Detecting pedestrians, cars, and traffic signs in a self-driving car system.
  • Models: YOLO (You Only Look Once), Faster R-CNN, SSD (Single Shot MultiBox Detector).

3. Semantic Segmentation

  • Goal: Assign a class label to each pixel in an image (dense prediction).
  • Example: Medical image analysis (identifying tumors in MRI scans).
  • Models: U-Net, FCN (Fully Convolutional Network), DeepLab.

4. Instance Segmentation

  • Goal: Detect and segment individual object instances (combines object detection + semantic segmentation).
  • Example: Differentiating between multiple cars in an image.
  • Models: Mask R-CNN, SOLO (Segmenting Objects by Locations).

5. Object Tracking

  • Goal: Follow a specific object's movement across video frames over time.
  • Example: Tracking a player in a sports video.
  • Methods: SORT (Simple Online and Realtime Tracking), DeepSORT, Kalman Filters.

6. Pose Estimation (Keypoint Detection)

  • Goal: Detect human or animal body poses by identifying key joints (e.g., elbows, knees).
  • Example: Fitness apps analyzing workout form.
  • Models: OpenPose, HRNet (High-Resolution Network).

7. Depth Estimation

  • Goal: Predict the distance (depth) of objects from the camera.
  • Example: Autonomous robots navigating environments.
  • Methods: Stereo vision, Monocular depth estimation (MiDaS, DORN).

8. Image Generation & Synthesis

  • Goal: Create new images using AI (e.g., GANs, Diffusion Models).
  • Example: Generating realistic faces with StyleGAN.
  • Models: GANs (Generative Adversarial Networks), Stable Diffusion.

9. Optical Character Recognition (OCR)

  • Goal: Extract text from images or scanned documents.
  • Example: Digitizing handwritten notes or license plate recognition.
  • Tools: Tesseract, EasyOCR, PaddleOCR.

10. Face Recognition

  • Goal: Identify or verify a person’s identity using facial features.
  • Example: Unlocking smartphones with Face ID.
  • Models: FaceNet, ArcFace, DeepFace.

11. Action Recognition (Video Analysis)

  • Goal: Classify human actions in videos (e.g., running, dancing).
  • Example: Surveillance systems detecting suspicious activities.
  • Models: 3D CNNs, Two-Stream Networks, SlowFast.

12. Image Super-Resolution

  • Goal: Enhance low-resolution images to higher resolution.
  • Example: Upscaling old photos.
  • Models: SRGAN (Super-Resolution GAN), ESRGAN.

13. Image Captioning

  • Goal: Generate a textual description of an image.
  • Example: Helping visually impaired users understand images.
  • Models: CNN + RNN (e.g., Show and Tell), Transformers (e.g., BLIP).

14. Anomaly Detection

  • Goal: Identify unusual patterns or defects in images.
  • Example: Detecting cracks in industrial products.
  • Methods: Autoencoders, GAN-based anomaly detection.

15. Visual Question Answering (VQA)

  • Goal: Answer questions about an image.
  • Example: AI answering "What color is the car?" from an image.
  • Models: LSTM + CNN, Vision-Language Transformers (e.g., BLIP-2)
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.