Papers
arxiv:2606.00793

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

Published on Jun 8
· Submitted by
Shengjun Zhang
on Jun 15
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A new benchmark called MBench is introduced to evaluate the memory capabilities of video world models, focusing on entity, environment, and causal consistency over extended temporal horizons.

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present MBench, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.

Community

Paper submitter

MBench is a benchmark for the memory capability of long-video world models. Most existing benchmarks reward single-frame quality or short-horizon prompt following. MBench targets a harder question: when a subject leaves the frame and returns, when the camera departs from a viewpoint and comes back, or when an off-screen physical process keeps evolving, can the model maintain a consistent world state? We decompose this into three orthogonal capability axes — Entity / Environment / Causal Consistency — and evaluate them under two complementary settings: MBench-A (action-conditioned, for action-conditioned world models) and MBench-T (text-segment-conditioned, for long-video text continuation models).

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.00793
Don't have the latest CLI?
curl -LsSf /cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.00793 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.00793 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.00793 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.