Walking World Model for Visually Impaired Path Following

State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences
University of Chinese Academy of Sciences
IEEE Robotics and Automation Letters (RA-L), 2025

Abstract

Guiding visually impaired individuals (VI) walking along planned paths is essential for enabling independent long-distance mobility. Current reactive approaches only correct deviations after they occur. These methods ignore VI’s walking dynamics (e.g., reaction latency and heading drift), resulting in frequent interventions that increase cognitive load, reduce walking efficiency, and may lead to missed turns. To address these limitations, we propose a predictive path-following approach enhanced by a walking world model to enable proactive guidance through vibrotactile guidance commands. Specifically, our walking world model is used to predict the future state of users after receiving specific commands. To mitigate the inefficiency in collecting action-annotated walking data, we exploit unannotated free-walking data to enhance model generalization. The model first undergoes self-supervised pre-training on a large unannotated dataset to learn general gait patterns, and is then fine-tuned on annotated data with action labels to model the walking dynamics of users given guidance commands. Integrated with model predictive control (MPC) specially considering cognitive load for the human, our method proactively optimizes instructions to minimize deviation, ensure safety, and reduce cognitive load. Experiments show significant improvements in walking speed and cognitive load over reactive baselines.

Method

Indoor Test

Indoor path-following experiments with trajectories, speed profiles, and quantitative metrics.

Indoor walking trajectory with cane and navigation app

(a) Visualization of M1’s walking trajectories under three methods: Stanley, Pure Pursuit, and our Walking world model.

Walking speed profile with cane and navigation app

(b) Visualization of V3’s walking speed profiles under three methods: Stanley, Pure Pursuit, and our Walking world model.

Indoor walking trajectory with System-PP

(c) Visualization of V1’s walking trajectories under three methods: Stanley, Pure Pursuit, and our Walking world model.

Walking speed profile with System-PP

(d) Visualization of V1’s walking speed profiles under three methods: Stanley, Pure Pursuit, and our Walking world model.

Indoor walking trajectory with System-WM

(e) Visualization of M1’s walking trajectories under three methods: Stanley, Pure Pursuit, and our Walking world model.

Walking speed profile with System-WM

(f) Visualization of M1’s walking speed profiles under three methods: Stanley, Pure Pursuit, and our Walking world model.

Indoor Scenario Results

Method Velocity (m/s) ↑ Collisions (/trial) ↓ Travel Length (m)
EM VI EM VI EM VI
Stanley 59.7 58.0 0.53 0.51 29.2 28.5
Pure Pursuit 63.4 56.6 0.51 0.50 30.1 27.8
Walking-WM (ours) 49.3 45.1 0.65 0.66 31.2 29.5

Table 1. Comparison of path-following approaches under static map. Bold indicates the best performance for each metric. ``EM'' stands for the eye-masked scenario, while ``VI'' stands for the visually impaired.

Method COV of walking speed (mean ± std)
Pure Pursuit 0.28 ± 0.17
Stanley 0.31 ± 0.19
Walking-WM (ours) 0.19 ± 0.10

Table 2. Coefficient of variation (COV) of walking speed across indoor trials. Lower COV indicates more stable walking speed.

Outdoor Test

Dynamic real-world path-following experiments on two daily-life routes, with objective performance and subjective evaluations.

Layout of Outdoor Route 1 with key street views

(a) Layouts of Route 1(240m) and Route 2(300m), with key street views marked along each path.

Outdoor Scenario Results

Method Velocity (m/s) ↑ Collisions (/trial) ↓
EM VI EM VI
Cane+App 0.62 | 0.62 0.74 | 0.71 1.00 | 1.50 0.75 | 0.75
System-PP 0.53 | 0.55 0.58 | 0.54 0.50 | 0.75 0.25 | 0.75
System-WM (ours) 0.67 | 0.70 0.74 | 0.73 0.25 | 0.25 0.25 | 0.50

Table 3. Walking performance on two outdoor routes. Each cell reports Route 1 | Route 2. Higher velocity and fewer collisions are better. “EM” denotes eye-masked participants, and “VI” denotes visually impaired participants.

Subjective Evaluation (Likert Scale)

Likert-scale ratings of safety, cognitive ease, and helpfulness for Pure Pursuit and Walking-WM

(b) Subjective ratings of safety, cognitive ease, and helpfulness for Pure Pursuit and Walking-WM, using a 7-point Likert scale (7 = very safe / very easy / very helpful). Bars represent mean ratings with standard error.

SWORD Workload Comparison

Participant V5 V6 V7 V8 M5 M6 M7 M8
Dominance score (Walking-WM vs Pure Pursuit) -1 -1 0 -2 -1 -2 -2 -1

Table 4. SWORD dominance scores of perceived workload when comparing Walking-WM to Pure Pursuit. Negative scores indicate that Walking-WM is perceived as less demanding, positive scores indicate more demanding.

Open-Loop Ablation Study

Ablation on the Walking world model's training, highlighting the effect of the learning-based method and the usage of large amount of unannotated data.

Comparison between different model configurations in open-loop experiments

Model Pre-training Prediction ADE (m) ↓ Prediction FDE (m) ↓
Hybrid model No 0.91 ± 0.10 1.69 ± 0.21
World model w/o pre-train No 0.44 ± 0.04 0.82 ± 0.09
World model w/ pre-train (ours) Yes 0.39 ± 0.03 0.73 ± 0.07

Table 3. Ablation results on different model variants. Hybrid model, a hand-crafted dynamics model adapted from the Bicycle Model with discrete “turn left / go straight / turn right” commands, assuming constant speed, a fixed steering angle and a fixed turning duration; “World model w/o pre-train” trains the world model only on action-annotated data; “World model w/ pre-train” first pre-trains on large-scale free-walking data and is then fine-tuned on user-specific action-annotated data.

What does the world model adds to the planning compared to the basic model (hybrid model)?

We conducted an experimental test using the above hybrid model on our collected annotated walking data. The test results are shown in the Table 3. These results demonstrate that the hybrid model produces high prediction errors and performs worse than world model. To investigate what our world model learned beyond this basic baseline, we provide some visual comparisons of representative prediction outcomes in figures as below to determine what the world model has actually learned, with detailed analysis as below:

Periodic variations in lateral position during go straight motion

(a) Periodic variations in lateral position during go straight motion

 Steering compensation due to extended reaction time

(b) Steering compensation due to extended reaction time

Inconsistent steering angle within a short period

(c) Inconsistent steering angle within a short period

Compared with mechanical systems (such as cars), human walking exhibits several unique characteristics, as illustrated above. When walking straight, a person’s center of mass oscillates periodically; during turning, due to reaction delays, a person may continue turning even after a “go straight” command has been issued; and within a very short time window, identical commands can still lead to different responses. The world model, being learned from data, is able to capture and model these human walking characteristics, whereas the basic model fails to do so, resulting in larger prediction errors.

Does extending the planning horizon yield better results for the basic model?

Walking trajectories under different methods in new static map

(d) Map for the extra static map experiment and a user’s walking trajectory using a basic hybrid model with planning horizon of 10 and Walking-WM with planning horizon of 5.

Method Walking time (s) ↓ Command frequency (/m) ↓
Pure Pursuit 84.5 0.62
Hybrid Model (H=5) 72.1 0.52
Hybrid model (H=10) 99.1 0.71
Walking-WM (H=5, ours) 66.3 0.46

Table 4. Results for the extra static map experiment.

To verify this statement, we conducted a new set of static map experiments. We deployed the map shown in Figure (d), which is more complex than the original static map and contains more obstacles. In this experiment, the hybrid model with a planning horizon of 10 resulted in the longest average completion time and highest command frequency, whereas the world model achieved the shortest time and lowest frequency. The poorer performance of the hybrid model with the longest horizon was mainly due to large prediction errors, which caused substantial deviations between the user’s actual trajectory and the system’s predictions, often bringing the user dangerously close to obstacles in our complex static map. To guide the user back to safety, the system issued more turning commands for obstacle avoidance.

Real-world Walking Test Video

Demonstration videos of our Walking world model in real-world walking tests.

BibTeX

@article{ju2025walking,
  title   = {Walking World Model for Visually Impaired Path Following},
  author  = {Ju, Haokun and Zhang, Lixuan and Cao, Xiangyu and Kan, Meina and Shan, Shiguang and Chen, Xilin},
  journal = {IEEE Robotics and Automation Letters},
  year    = {2025},
  volume  = {X},
  number  = {Y},
  pages   = {1--8},
  note    = {To appear},
  url     = {https://haokunju.github.io/walking-world-model}
}