Walking World Model for Visually Impaired Path Following

Ju, Haokun; Zhang, Lixuan; Cao, Xiangyu; Kan, Meina; Shan, Shiguang; Chen, Xilin

Walking World Model for Visually Impaired Path Following

Haokun Ju, Lixuan Zhang, Xiangyu Cao, Meina Kan, Shiguang Shan, Xilin Chen

State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences
University of Chinese Academy of Sciences
IEEE Robotics and Automation Letters (RA-L), 2025

Paper arXiv

Abstract

Guiding visually impaired individuals (VI) walking along planned paths is essential for enabling independent long-distance mobility. Current reactive approaches only correct deviations after they occur. These methods ignore VI’s walking dynamics (e.g., reaction latency and heading drift), resulting in frequent interventions that increase cognitive load, reduce walking efficiency, and may lead to missed turns. To address these limitations, we propose a predictive path-following approach enhanced by a walking world model to enable proactive guidance through vibrotactile guidance commands. Specifically, our walking world model is used to predict the future state of users after receiving specific commands. To mitigate the inefficiency in collecting action-annotated walking data, we exploit unannotated free-walking data to enhance model generalization. The model first undergoes self-supervised pre-training on a large unannotated dataset to learn general gait patterns, and is then fine-tuned on annotated data with action labels to model the walking dynamics of users given guidance commands. Integrated with model predictive control (MPC) specially considering cognitive load for the human, our method proactively optimizes instructions to minimize deviation, ensure safety, and reduce cognitive load. Experiments show significant improvements in walking speed and cognitive load over reactive baselines.

Method

Training paradigm of the Walking-WM.

Pre-training from action-free walking data. Given an action-free walking sequence s₁, s₂, …, s_T, a single-layer bidirectional GRU g is used as an action extractor to infer latent actions between adjacent states: {a^*₁, …, a^*_T−1} = g(s₁, …, s_T). These latent actions drive a recurrent state-space world model composed of a recurrent model, a latent transition model, a state decoder and a representation model: h_t+1 = r(h_t, z_t, a^*_t), ẑ_t+1 = h(h_t+1), ŝ_t+1 = d(ẑ_t+1), z_t+1 = q(h_t+1, s_t+1). All components are trained jointly with a reconstruction and consistency loss L_pre-train = ∥ẑ_t − z_t∥² + α∥s_t − d(z_t)∥². Here, the first term encourages consistent latent dynamics, and the second term encourages accurate reconstruction of observable states.

Fine-tuning with action-annotated walking data. After pre-training, the world model is fine-tuned on trajectories with explicit guidance commands a_t. To reuse the latent dynamics learned from free walking, an action adapter f (a 2-layer MLP) maps each state–action pair into the same latent action space: ã_t = f(s_t, a_t). The world model is then updated as h_t+1 = r(h_t, z_t, ã_t), ẑ_t+1 = h(h_t+1), ŝ_t+1 = d(ẑ_t+1), z_t+1 = q(h_t+1, s_t+1). The fine-tuning objective is L_fine-tune = ∥ẑ_t − z_t∥² + α∥s_t − d(z_t)∥² + λ∥w∥², where ∥w∥ is the L2 norm of model parameters. In other words, during pre-training the latent actions come from the extractor g, and during fine-tuning the implicit actions are obtained by transforming explicit commands through the adapter f.

Predictive path-following with a learned walking world model. Left: an MPC loop samples command sequences, scores them with the world model under a multi-objective reward with safety constraints, closes the loop with the user's current state, and executes the first action of the best sequence. Right: the world model maintains a recurrent latent state to encode history, advances the latent via a transition model, and decodes it to predict the next user state.

World model inside the MPC loop. After fine-tuning, the learned world model p̂ approximates the unknown user dynamics p. At time t, for a candidate sequence of discrete guidance commands {a_t, …, a_t+H−1}, the controller rolls out future states ŝ_t+1, …, ŝ_t+H recursively: ŝ_t+1 = p̂(s_t, a_t), ŝ_t+k = p̂(ŝ_t+k−1, a_t+k−1) for k = 2,…,H. The corresponding 3D positions p_t+k are obtained by projecting the predicted states with a fixed matrix T_p.

MPC objective and safety constraints. The planner selects the sequence {a_t, …, a_t+H−1} to maximize a cumulative reward R that balances path tracking, command frequency, goal reaching and safety:
R = ∑_k=1^H−1 ( −∥p_t+k − c(p_t+k, P^*)∥² − ω₁∥a_t+k∥² ) − ω₂∥p_t+H − p^*_n∥² + ω₃ ∑_k=1^H 1 / e(p_t+k, Ω).
Here, c(·, P^*) returns the closest point on the reference path, e(·, Ω) is the distance to the nearest obstacle in the obstacle set Ω, and ω₁, ω₂, ω₃ are weighting coefficients. The constraints enforce that the predicted motion is generated by the world model, trajectories remain collision free, and commands lie in the discrete set {−1, 0, 1} corresponding to "turn left", "go straight" and "turn right". At each time step, only the first action a^*_t of the optimized sequence is sent as the guidance command, and the optimization is repeated in a receding-horizon manner.

Indoor Test

Indoor path-following experiments with trajectories, speed profiles, and quantitative metrics.

Indoor walking trajectory with cane and navigation app

(a) Visualization of M1’s walking trajectories under three methods: Stanley, Pure Pursuit, and our Walking world model.

(b) Visualization of V3’s walking speed profiles under three methods: Stanley, Pure Pursuit, and our Walking world model.

Indoor walking trajectory with System-PP

(c) Visualization of V1’s walking trajectories under three methods: Stanley, Pure Pursuit, and our Walking world model.

(d) Visualization of V1’s walking speed profiles under three methods: Stanley, Pure Pursuit, and our Walking world model.

Indoor walking trajectory with System-WM

(e) Visualization of M1’s walking trajectories under three methods: Stanley, Pure Pursuit, and our Walking world model.

(f) Visualization of M1’s walking speed profiles under three methods: Stanley, Pure Pursuit, and our Walking world model.

Indoor Scenario Results

Method	Velocity (m/s) ↑		Collisions (/trial) ↓		Travel Length (m)
Method	EM	VI	EM	VI	EM	VI
Stanley	59.7	58.0	0.53	0.51	29.2	28.5
Pure Pursuit	63.4	56.6	0.51	0.50	30.1	27.8
Walking-WM (ours)	49.3	45.1	0.65	0.66	31.2	29.5

Table 1. Comparison of path-following approaches under static map. Bold indicates the best performance for each metric. ``EM'' stands for the eye-masked scenario, while ``VI'' stands for the visually impaired.

Method	COV of walking speed (mean ± std)
Pure Pursuit	0.28 ± 0.17
Stanley	0.31 ± 0.19
Walking-WM (ours)	0.19 ± 0.10

Table 2. Coefficient of variation (COV) of walking speed across indoor trials. Lower COV indicates more stable walking speed.

Outdoor Test

Dynamic real-world path-following experiments on two daily-life routes, with objective performance and subjective evaluations.

Layout of Outdoor Route 1 with key street views

(a) Layouts of Route 1(240m) and Route 2(300m), with key street views marked along each path.

Outdoor Scenario Results

Method	Velocity (m/s) ↑		Collisions (/trial) ↓
Method	EM	VI	EM	VI
Cane+App	0.62 \| 0.62	0.74 \| 0.71	1.00 \| 1.50	0.75 \| 0.75
System-PP	0.53 \| 0.55	0.58 \| 0.54	0.50 \| 0.75	0.25 \| 0.75
System-WM (ours)	0.67 \| 0.70	0.74 \| 0.73	0.25 \| 0.25	0.25 \| 0.50

Table 3. Walking performance on two outdoor routes. Each cell reports Route 1 | Route 2. Higher velocity and fewer collisions are better. “EM” denotes eye-masked participants, and “VI” denotes visually impaired participants.

Subjective Evaluation (Likert Scale)

Likert-scale ratings of safety, cognitive ease, and helpfulness for Pure Pursuit and Walking-WM

(b) Subjective ratings of safety, cognitive ease, and helpfulness for Pure Pursuit and Walking-WM, using a 7-point Likert scale (7 = very safe / very easy / very helpful). Bars represent mean ratings with standard error.

SWORD Workload Comparison

Participant	V5	V6	V7	V8	M5	M6	M7	M8
Dominance score (Walking-WM vs Pure Pursuit)	-1	-1	0	-2	-1	-2	-2	-1

Table 4. SWORD dominance scores of perceived workload when comparing Walking-WM to Pure Pursuit. Negative scores indicate that Walking-WM is perceived as less demanding, positive scores indicate more demanding.

Open-Loop Ablation Study

Ablation on the Walking world model's training, highlighting the effect of the learning-based method and the usage of large amount of unannotated data.

Comparison between different model configurations in open-loop experiments

Model	Pre-training	Prediction ADE (m) ↓	Prediction FDE (m) ↓
Hybrid model	No	0.91 ± 0.10	1.69 ± 0.21
World model w/o pre-train	No	0.44 ± 0.04	0.82 ± 0.09
World model w/ pre-train (ours)	Yes	0.39 ± 0.03	0.73 ± 0.07

Table 3. Ablation results on different model variants. Hybrid model, a hand-crafted dynamics model adapted from the Bicycle Model with discrete “turn left / go straight / turn right” commands, assuming constant speed, a fixed steering angle and a fixed turning duration; “World model w/o pre-train” trains the world model only on action-annotated data; “World model w/ pre-train” first pre-trains on large-scale free-walking data and is then fine-tuned on user-specific action-annotated data.

What does the world model adds to the planning compared to the basic model (hybrid model)?

We conducted an experimental test using the above hybrid model on our collected annotated walking data. The test results are shown in the Table 3. These results demonstrate that the hybrid model produces high prediction errors and performs worse than world model. To investigate what our world model learned beyond this basic baseline, we provide some visual comparisons of representative prediction outcomes in figures as below to determine what the world model has actually learned, with detailed analysis as below:

(a) Periodic variations in lateral position during go straight motion

(b) Steering compensation due to extended reaction time

(c) Inconsistent steering angle within a short period

Compared with mechanical systems (such as cars), human walking exhibits several unique characteristics, as illustrated above. When walking straight, a person’s center of mass oscillates periodically; during turning, due to reaction delays, a person may continue turning even after a “go straight” command has been issued; and within a very short time window, identical commands can still lead to different responses. The world model, being learned from data, is able to capture and model these human walking characteristics, whereas the basic model fails to do so, resulting in larger prediction errors.

Does extending the planning horizon yield better results for the basic model?

Walking trajectories under different methods in new static map

(d) Map for the extra static map experiment and a user’s walking trajectory using a basic hybrid model with planning horizon of 10 and Walking-WM with planning horizon of 5.

Method	Walking time (s) ↓	Command frequency (/m) ↓
Pure Pursuit	84.5	0.62
Hybrid Model (H=5)	72.1	0.52
Hybrid model (H=10)	99.1	0.71
Walking-WM (H=5, ours)	66.3	0.46

Table 4. Results for the extra static map experiment.

To verify this statement, we conducted a new set of static map experiments. We deployed the map shown in Figure (d), which is more complex than the original static map and contains more obstacles. In this experiment, the hybrid model with a planning horizon of 10 resulted in the longest average completion time and highest command frequency, whereas the world model achieved the shortest time and lowest frequency. The poorer performance of the hybrid model with the longest horizon was mainly due to large prediction errors, which caused substantial deviations between the user’s actual trajectory and the system’s predictions, often bringing the user dangerously close to obstacles in our complex static map. To guide the user back to safety, the system issued more turning commands for obstacle avoidance.

Real-world Walking Test Video

Demonstration videos of our Walking world model in real-world walking tests.

BibTeX

@article{ju2025walking,
  title   = {Walking World Model for Visually Impaired Path Following},
  author  = {Ju, Haokun and Zhang, Lixuan and Cao, Xiangyu and Kan, Meina and Shan, Shiguang and Chen, Xilin},
  journal = {IEEE Robotics and Automation Letters},
  year    = {2025},
  volume  = {X},
  number  = {Y},
  pages   = {1--8},
  note    = {To appear},
  url     = {https://haokunju.github.io/walking-world-model}
}

Method	Velocity (m/s) ↑		Collisions (/trial) ↓
Method	EM	VI	EM	VI
Cane+App	0.62 \| 0.62	0.74 \| 0.71	1.00 \| 1.50	0.75 \| 0.75
System-PP	0.53 \| 0.55	0.58 \| 0.54	0.50 \| 0.75	0.25 \| 0.75
System-WM (ours)	0.67 \| 0.70	0.74 \| 0.73	0.25 \| 0.25	0.25 \| 0.50

More Works from Our Lab

PreLAR: World Model Pre-training with Learnable Action Representation

eLabrador: A Wearable Navigation System for Visually Impaired Individuals

Walking World Model for Visually Impaired Path Following

Abstract

Method

Training paradigm of the Walking-WM.

Indoor Test

Indoor Scenario Results

Outdoor Test

Outdoor Scenario Results

Subjective Evaluation (Likert Scale)

SWORD Workload Comparison

Open-Loop Ablation Study

Comparison between different model configurations in open-loop experiments

What does the world model adds to the planning compared to the basic model (hybrid model)?

Does extending the planning horizon yield better results for the basic model?

Real-world Walking Test Video

BibTeX