World Model — Explained · AI Sözlüğü

Definition

A classical model works "input → output." A world model goes one step further: inside the model lives a simulation of the world — an internal model of physics, causality, and temporal dynamics.

Practically: an AI that can predict where a thrown ball lands, what's behind a door if you open it, what will happen if an agent presses a button. Beyond language — reasoning in the physical world.

Leading examples (2024-2025): - OpenAI Sora: text-to-video, but not just a video generator; a world model underneath (object consistency, physics) - Google Genie 2: generates an interactive 3D game world from a single image - DeepMind DreamerV3: builds world models for RL in game environments - NVIDIA Cosmos: physical world-model platform for robotics - Wayve / Tesla FSD: internal world simulation for self-driving

World models are seen as a critical milestone toward AGI. "Truly understanding" requires being able to simulate the world.

Analogy

A chess master's mind runs "if I make this move, opponent likely plays this, then I play that…" — an internal simulation. World models apply that to the real world: "if I open this door, what will I see? if it rains, how does the ground behave? if I throw the ball, what trajectory?"

Real-world example

OpenAI Sora (2024): given the prompt "rainy night in Tokyo, woman walking on a reflective street," Sora produces a 60-second video. What matters isn't the visual quality — her outfit stays the same every second, the reflection shifts with the lighting, raindrops splash on impact. Not a physics engine — an internal world simulation the model learned.

Google Genie 2 (late 2024): user provides one image, Genie simulates "what kind of 3D game could this be physically?" and yields a playable environment. First-ever consistent multi-hour interactive simulation from a single image.

Tesla FSD: an internal world model that predicts "what happens 5 seconds from now?" before each action. Without it, defensive driving is impossible.

When to use

Video generation (consistent objects, physics) — Sora, Runway, Veo
Robot training — simulate before physical experience
Autonomous vehicles — future prediction is mandatory
Game generation (Genie-style procedural)
RL (reinforcement learning) environments

When not to use

Today's production AI products — APIs aren't mature yet
Single-frame image generation (diffusion is enough)
Pure text tasks
Tight budget — world models are massively expensive to train

Common pitfalls

Hallucinated physics

World models 'learn' physics but don't model it perfectly. In Sora videos, objects sometimes vanish, hands deform. It's a statistical approximation, not a physics engine.

Overhyped AGI claims

World model ≠ AGI. Important step but not sufficient alone. The industry stretches this term for marketing — read critically.

Compute is brutally expensive

Training models like Sora costs $100M+. Inference is also pricey (seconds of video rendered second-by-second). Consumer-grade world models are still far off.