AI Dictionary
Intermediate· ~1 min read#computer-vision#vision#cnn

Computer Vision

Machine seeing

The AI sub-field that lets computers extract meaning from visual data — classification, object detection, segmentation, OCR.

MACHINE EYES — OBJECT RECOGNITIONsun · 0.99tree · 0.94car · 0.96person · 0.92the model labels each object: location + class + confidence
Definition

Computer Vision (CV) is the discipline of getting computers to make sense of images and video. It answers: "what's in this photo?", "where is this object?", "how many people are in this scene?", "what does this handwriting say?"

Major task types: - Image classification: one label ("cat/dog") - Object detection: draw boxes + labels (YOLO, Faster R-CNN) - Segmentation: pixel-wise labels (Mask R-CNN, SAM) - OCR: read text in images (Tesseract, EasyOCR, GPT-4V) - Pose estimation: detect human joints

Architecture was dominated for years by CNNs (Convolutional Neural Networks). Since 2020, Vision Transformers (ViT) and multimodal models (GPT-4V, Claude Vision) took over.

Analogy

Like the human eye + brain: light hits photoreceptors (vision), the brain interprets "that's a cat, that's a table." CV solves the same pipeline mathematically — pixels become numbers, pass through successive layers, and out comes "what's there, where."

Real-world example

Tesla's Autopilot is real-time CV in production: 8 cameras at 36 fps, each frame: 1. Object detection → cars, pedestrians, cyclists, signs 2. Lane detection → lane lines and distances 3. Depth estimation → 3D scene map 4. Path planning → next motion decision

The pipeline must finish in under 50ms — late decisions kill. Here CV isn't about AI; it's the foundation of physical safety.

When to use
  • Extracting structured data from images (form OCR, invoices)
  • Automated moderation (flagging uploads)
  • Medical imaging: X-ray, MRI, microscopy analysis
  • Manufacturing QA — camera-based defect detection
  • Accessibility — describing images for visually impaired users
When not to use
  • Text-only tasks (overkill)
  • Replacing precise medical diagnosis — assistive yes, doctor-replacement no
  • Low-resolution/dark images — models fail badly
Common pitfalls

Training data ≠ production

An ImageNet-trained model fails on real-world low light, angles, occlusion. Fine-tune on production data.

Bias and fairness

Face recognition has historically performed worse on certain skin tones and genders. Demographic balance in training data is critical.

Adversarial attacks

Imperceptible noise added to an image makes a model see a 'gibbon' instead of a 'panda'. CV models are brittle; consider security in production.