Multimodal
Many modes, one model
An LLM that can process multiple data types — text, image, audio, sometimes video — in a single model.
Classic LLMs only take text in and produce text out. Multimodal models handle multiple modalities in the same model: upload a photo and ask "what's in this?", give it an audio file for a transcript, or pass mixed image + text input.
Under the hood: each modality (image, audio) is first projected into a shared embedding space, then the LLM treats those embeddings like extra tokens. The model has "eyes" and "ears" but the brain is still the same.
Examples: GPT-4o, Claude Sonnet, Gemini, Llama 4. Each supports different modalities at different quality levels — one might have great OCR while another has stronger audio understanding.
Old phone: only makes calls. Smartphone: calls + photos + voice notes + GPS + internet. Same device, multiple sense organs. Multimodal LLM = one mind with multiple senses.
A user uploads a photo of their fridge and asks "what can I cook with this?" The multimodal model: 1. Analyzes the image: sees milk, eggs, cheese, tomatoes, cucumber. 2. Adds that to the text context. 3. Replies "Try 3 eggs + cheese + tomatoes for menemen, recipe…"
What a text-only LLM can't do: the user would have had to type out every ingredient first.
- Visual analysis: OCR, product recognition, screen interpretation
- Speech → text: transcripts, call analysis
- Mixed input: 'how do I fix the bug shown in this screenshot, in code?'
- Accessibility: describing photos for visually impaired users
- Pure-text tasks — multimodal models are usually slower and pricier
- Specialist tasks where dedicated vision/audio models are far better (medical imaging, music analysis)
- Precise visual measurement — models can't read pixels exactly or give coordinates
Token cost balloons
A single high-res image is ~1500–2000 tokens. A 5-image prompt fills the context window.
Uneven quality across modalities
A model can be excellent at text but weak at audio transcription. Test each modality separately before shipping.
Subtle visual errors
Counting, distance, ratio questions are still weak. 'There are 7 people' might really be 5. Add a verification layer for critical decisions.