InSpatio-World — Beyond the Frame. Into the World.

Turn any video into a dynamic 4D world

InSpatio-World is the first 4D world model that is conditioned on reference videos, transforming a single video into a dynamic world you can freely explore, navigate, and revisit.

Since the physical world is inherently three-dimensional and continuously evolving, existing 2D or static models fail to capture its true spatial relationships and causal motions. InSpatio-World overcomes these limitations, enabling an immersive experience in a world that evolves over time.

1

Capabilities

Heraclitus said, "No man ever steps in the same river twice." We built a world where you can.

While traditional media offers only a fixed perspective, the real world flows through a continuous evolution of space and time. InSpatio-World transforms static images into immersive, navigable realities, granting you the agency to explore every dimension and moment with unprecedented freedom.

1

Free Spatial Roaming

Immerse yourself in the scene and experience the same event from diverse vantage points.

2

Temporal Control

Pause, slow down, or even reverse time to re-experience captured moments.

3

Physical Realism

Drawing from the natural dynamics of the reference video, we preserve physically consistent and realistic dynamics.

4

Long-Horizon Stability

Even under extended exploration, the world remains anchored to the reference video, preventing drift and preserving consistency with the source scene.

2

Method

The physical world evolves continuously under fundamental laws.

For intelligent agents, merely processing pixels is insufficient; true understanding requires the ability to predict how the world state unfolds over time.

To achieve this, a world model should be built upon three pillars:

1

Physical Constancy:

Laws of physics—such as gravity, collisions, and inertia—must be applied consistently so that every interaction follows authentic dynamics.

2

Spatial Persistence:

Objects and environments must exist stably in 3D space. Their physical properties and positions should remain consistent even when they temporarily leave the field of view.

3

Temporal Causality:

The world state must evolve based on causal logic rather than mere visual continuity.

Existing generative models simulate pixels rather than persistent worlds. This leads to three fundamental limitations: physical inconsistency (e.g., interpenetration or unsupported object floating), spatial fragility (objects outside the field of view become unstable or are not preserved), and temporal drift (world states degrade over long sequences).

To overcome these limitations, we introduce State-Anchored World Modeling, representing the world as a viewpoint-independent Local World State anchored to a reference video. Rather than generating frames independently, the model maintains and evolves the full world state over time.

Existing stateless methods lack persistent world states, making it difficult to model physical and spatial consistency. An ideal world model should maintain a self-evolving state from which observations can be sampled consistently across space and time. InSpatio-World addresses this by anchoring a local world state, enabling spatiotemporally consistent sampling and mitigating long-term drift, taking a key step toward this goal.

This is achieved through three novel components:

World State Anchoring

— Constructs a persistent world state to ensure spatial persistence and physical constancy.

Spatiotemporal Autoregression

— Performs precise spatiotemporal sampling conditioned on the reference video, enabling free navigation across viewpoints and temporal progression.

Joint Distribution Matching Distillation

— Learns a joint distribution that balances real-world fidelity and synthetic controllability, enabling stable generalization under user interaction.

The reference video is anchored as a local world state to provide a consistent physical baseline. Through spatiotemporal autoregression, the model ensures all generated observations remain grounded in this state, achieving robust spatiotemporal consistency.

3

Evaluation

To systematically evaluate generative world models, we conducted experiments using the WorldScore benchmark. This benchmark provides a unified framework to assess 3D, 4D, and video generation systems in terms of controllability, visual quality, and dynamic consistency, going beyond traditional static visual fidelity.

Most existing approaches remain limited: they rely on static representations, lack explicit modeling of persistent World State, and follow offline generation pipelines that hinder real-time interaction and dynamic evolution.

In contrast, InSpatio-World is designed for real-time, interactive world modeling, fully aligned with WorldScore’s dynamic evaluation criteria. Our 1.3B-parameter model ranks first among all real-time methods on the WorldScore-Dynamic leaderboard and runs at 24 FPS on a single GPU. This enables users to freely navigate spatial viewpoints alongside temporal progression—marking a fundamental shift from static video generation to a living, interactive world observation system.