Method
The physical world evolves continuously under fundamental laws.
For intelligent agents, merely processing pixels is insufficient; true understanding requires the ability to predict how the world state unfolds over time.
To achieve this, a world model should be built upon three pillars:
1
Physical Constancy:
Laws of physics—such as gravity, collisions, and inertia—must be applied consistently so that every interaction follows authentic dynamics.
2
Spatial Persistence:
Objects and environments must exist stably in 3D space. Their physical properties and positions should remain consistent even when they temporarily leave the field of view.
3
Temporal Causality:
The world state must evolve based on causal logic rather than mere visual continuity.
Existing generative models simulate pixels rather than persistent worlds. This leads to three fundamental limitations: physical inconsistency (e.g., interpenetration or unsupported object floating), spatial fragility (objects outside the field of view become unstable or are not preserved), and temporal drift (world states degrade over long sequences).
To overcome these limitations, we introduce State-Anchored World Modeling, representing the world as a viewpoint-independent Local World State anchored to a reference video. Rather than generating frames independently, the model maintains and evolves the full world state over time.
This is achieved through three novel components:
World State Anchoring
— Constructs a persistent world state to ensure spatial persistence and physical constancy.
Spatiotemporal Autoregression
— Performs precise spatiotemporal sampling conditioned on the reference video, enabling free navigation across viewpoints and temporal progression.
Joint Distribution Matching Distillation
— Learns a joint distribution that balances real-world fidelity and synthetic controllability, enabling stable generalization under user interaction.