AI Dictionary
Intermediate· ~2 min read#multimodal#vision#audio

Multimodal

Many modes, one model

An LLM that can process multiple data types — text, image, audio, sometimes video — in a single model.

ONE MODEL · MANY MODALITIESAaTEXTinput streamIMAGEinput streamAUDIOinput streamMODELsharedembedding spaceUNIFIED"a cat on asunny porch"text, image, audio — all map into the same internal language
Definition

Classic LLMs only take text in and produce text out. Multimodal models handle multiple modalities in the same model: upload a photo and ask "what's in this?", give it an audio file for a transcript, or pass mixed image + text input.

Under the hood: each modality (image, audio) is first projected into a shared embedding space, then the LLM treats those embeddings like extra tokens. The model has "eyes" and "ears" but the brain is still the same.

Examples: GPT-4o, Claude Sonnet, Gemini, Llama 4. Each supports different modalities at different quality levels — one might have great OCR while another has stronger audio understanding.

Analogy

Old phone: only makes calls. Smartphone: calls + photos + voice notes + GPS + internet. Same device, multiple sense organs. Multimodal LLM = one mind with multiple senses.

Real-world example

A user uploads a photo of their fridge and asks "what can I cook with this?" The multimodal model: 1. Analyzes the image: sees milk, eggs, cheese, tomatoes, cucumber. 2. Adds that to the text context. 3. Replies "Try 3 eggs + cheese + tomatoes for menemen, recipe…"

What a text-only LLM can't do: the user would have had to type out every ingredient first.

When to use
  • Visual analysis: OCR, product recognition, screen interpretation
  • Speech → text: transcripts, call analysis
  • Mixed input: 'how do I fix the bug shown in this screenshot, in code?'
  • Accessibility: describing photos for visually impaired users
When not to use
  • Pure-text tasks — multimodal models are usually slower and pricier
  • Specialist tasks where dedicated vision/audio models are far better (medical imaging, music analysis)
  • Precise visual measurement — models can't read pixels exactly or give coordinates
Common pitfalls

Token cost balloons

A single high-res image is ~1500–2000 tokens. A 5-image prompt fills the context window.

Uneven quality across modalities

A model can be excellent at text but weak at audio transcription. Test each modality separately before shipping.

Subtle visual errors

Counting, distance, ratio questions are still weak. 'There are 7 people' might really be 5. Add a verification layer for critical decisions.