Computer Vision — Explained

Definition

Computer Vision (CV) is the discipline of getting computers to make sense of images and video. It answers: "what's in this photo?", "where is this object?", "how many people are in this scene?", "what does this handwriting say?"

Major task types: - Image classification: one label ("cat/dog") - Object detection: draw boxes + labels (YOLO, Faster R-CNN) - Segmentation: pixel-wise labels (Mask R-CNN, SAM) - OCR: read text in images (Tesseract, EasyOCR, GPT-4V) - Pose estimation: detect human joints

Architecture was dominated for years by CNNs (Convolutional Neural Networks). Since 2020, Vision Transformers (ViT) and multimodal models (GPT-4V, Claude Vision) took over.

Analogy

Like the human eye + brain: light hits photoreceptors (vision), the brain interprets "that's a cat, that's a table." CV solves the same pipeline mathematically — pixels become numbers, pass through successive layers, and out comes "what's there, where."

Real-world example

Tesla's Autopilot is real-time CV in production: 8 cameras at 36 fps, each frame: 1. Object detection → cars, pedestrians, cyclists, signs 2. Lane detection → lane lines and distances 3. Depth estimation → 3D scene map 4. Path planning → next motion decision

The pipeline must finish in under 50ms — late decisions kill. Here CV isn't about AI; it's the foundation of physical safety.

When to use

Extracting structured data from images (form OCR, invoices)
Automated moderation (flagging uploads)
Medical imaging: X-ray, MRI, microscopy analysis
Manufacturing QA — camera-based defect detection
Accessibility — describing images for visually impaired users

When not to use

Text-only tasks (overkill)
Replacing precise medical diagnosis — assistive yes, doctor-replacement no
Low-resolution/dark images — models fail badly

Common pitfalls

Training data ≠ production

An ImageNet-trained model fails on real-world low light, angles, occlusion. Fine-tune on production data.

Bias and fairness

Face recognition has historically performed worse on certain skin tones and genders. Demographic balance in training data is critical.

Adversarial attacks

Imperceptible noise added to an image makes a model see a 'gibbon' instead of a 'panda'. CV models are brittle; consider security in production.