Computer Vision
Machine seeing
The AI sub-field that lets computers extract meaning from visual data — classification, object detection, segmentation, OCR.
Computer Vision (CV) is the discipline of getting computers to make sense of images and video. It answers: "what's in this photo?", "where is this object?", "how many people are in this scene?", "what does this handwriting say?"
Major task types: - Image classification: one label ("cat/dog") - Object detection: draw boxes + labels (YOLO, Faster R-CNN) - Segmentation: pixel-wise labels (Mask R-CNN, SAM) - OCR: read text in images (Tesseract, EasyOCR, GPT-4V) - Pose estimation: detect human joints
Architecture was dominated for years by CNNs (Convolutional Neural Networks). Since 2020, Vision Transformers (ViT) and multimodal models (GPT-4V, Claude Vision) took over.
Like the human eye + brain: light hits photoreceptors (vision), the brain interprets "that's a cat, that's a table." CV solves the same pipeline mathematically — pixels become numbers, pass through successive layers, and out comes "what's there, where."
Tesla's Autopilot is real-time CV in production: 8 cameras at 36 fps, each frame: 1. Object detection → cars, pedestrians, cyclists, signs 2. Lane detection → lane lines and distances 3. Depth estimation → 3D scene map 4. Path planning → next motion decision
The pipeline must finish in under 50ms — late decisions kill. Here CV isn't about AI; it's the foundation of physical safety.
- Extracting structured data from images (form OCR, invoices)
- Automated moderation (flagging uploads)
- Medical imaging: X-ray, MRI, microscopy analysis
- Manufacturing QA — camera-based defect detection
- Accessibility — describing images for visually impaired users
- Text-only tasks (overkill)
- Replacing precise medical diagnosis — assistive yes, doctor-replacement no
- Low-resolution/dark images — models fail badly
Training data ≠ production
An ImageNet-trained model fails on real-world low light, angles, occlusion. Fine-tune on production data.
Bias and fairness
Face recognition has historically performed worse on certain skin tones and genders. Demographic balance in training data is critical.
Adversarial attacks
Imperceptible noise added to an image makes a model see a 'gibbon' instead of a 'panda'. CV models are brittle; consider security in production.