World Model
An internal simulator inside the model
A model that builds an internal simulation of the real world — predicting physics, action consequences, and how objects move.
A classical model works "input → output." A world model goes one step further: inside the model lives a simulation of the world — an internal model of physics, causality, and temporal dynamics.
Practically: an AI that can predict where a thrown ball lands, what's behind a door if you open it, what will happen if an agent presses a button. Beyond language — reasoning in the physical world.
Leading examples (2024-2025): - OpenAI Sora: text-to-video, but not just a video generator; a world model underneath (object consistency, physics) - Google Genie 2: generates an interactive 3D game world from a single image - DeepMind DreamerV3: builds world models for RL in game environments - NVIDIA Cosmos: physical world-model platform for robotics - Wayve / Tesla FSD: internal world simulation for self-driving
World models are seen as a critical milestone toward AGI. "Truly understanding" requires being able to simulate the world.
A chess master's mind runs "if I make this move, opponent likely plays this, then I play that…" — an internal simulation. World models apply that to the real world: "if I open this door, what will I see? if it rains, how does the ground behave? if I throw the ball, what trajectory?"
OpenAI Sora (2024): given the prompt "rainy night in Tokyo, woman walking on a reflective street," Sora produces a 60-second video. What matters isn't the visual quality — her outfit stays the same every second, the reflection shifts with the lighting, raindrops splash on impact. Not a physics engine — an internal world simulation the model learned.
Google Genie 2 (late 2024): user provides one image, Genie simulates "what kind of 3D game could this be physically?" and yields a playable environment. First-ever consistent multi-hour interactive simulation from a single image.
Tesla FSD: an internal world model that predicts "what happens 5 seconds from now?" before each action. Without it, defensive driving is impossible.
- Video generation (consistent objects, physics) — Sora, Runway, Veo
- Robot training — simulate before physical experience
- Autonomous vehicles — future prediction is mandatory
- Game generation (Genie-style procedural)
- RL (reinforcement learning) environments
- Today's production AI products — APIs aren't mature yet
- Single-frame image generation (diffusion is enough)
- Pure text tasks
- Tight budget — world models are massively expensive to train
Hallucinated physics
World models 'learn' physics but don't model it perfectly. In Sora videos, objects sometimes vanish, hands deform. It's a statistical approximation, not a physics engine.
Overhyped AGI claims
World model ≠ AGI. Important step but not sufficient alone. The industry stretches this term for marketing — read critically.
Compute is brutally expensive
Training models like Sora costs $100M+. Inference is also pricey (seconds of video rendered second-by-second). Consumer-grade world models are still far off.