Abstract
Guiding visually impaired individuals (VI) walking along planned paths is essential for enabling independent long-distance mobility. Current reactive approaches only correct deviations after they occur. These methods ignore VI’s walking dynamics (e.g., reaction latency and heading drift), resulting in frequent interventions that increase cognitive load, reduce walking efficiency, and may lead to missed turns. To address these limitations, we propose a predictive path-following approach enhanced by a walking world model to enable proactive guidance through vibrotactile guidance commands. Specifically, our walking world model is used to predict the future state of users after receiving specific commands. To mitigate the inefficiency in collecting action-annotated walking data, we exploit unannotated free-walking data to enhance model generalization. The model first undergoes self-supervised pre-training on a large unannotated dataset to learn general gait patterns, and is then fine-tuned on annotated data with action labels to model the walking dynamics of users given guidance commands. Integrated with model predictive control (MPC) specially considering cognitive load for the human, our method proactively optimizes instructions to minimize deviation, ensure safety, and reduce cognitive load. Experiments show significant improvements in walking speed and cognitive load over reactive baselines.
Method
Training paradigm of the Walking-WM.
Pre-training from action-free walking data. Given an action-free walking sequence s1, s2, …, sT, a single-layer bidirectional GRU g is used as an action extractor to infer latent actions between adjacent states: {a*1, …, a*T−1} = g(s1, …, sT). These latent actions drive a recurrent state-space world model composed of a recurrent model, a latent transition model, a state decoder and a representation model: ht+1 = r(ht, zt, a*t), ẑt+1 = h(ht+1), ŝt+1 = d(ẑt+1), zt+1 = q(ht+1, st+1). All components are trained jointly with a reconstruction and consistency loss Lpre-train = ∥ẑt − zt∥2 + α∥st − d(zt)∥2. Here, the first term encourages consistent latent dynamics, and the second term encourages accurate reconstruction of observable states.
Fine-tuning with action-annotated walking data. After pre-training, the world model is fine-tuned on trajectories with explicit guidance commands at. To reuse the latent dynamics learned from free walking, an action adapter f (a 2-layer MLP) maps each state–action pair into the same latent action space: ãt = f(st, at). The world model is then updated as ht+1 = r(ht, zt, ãt), ẑt+1 = h(ht+1), ŝt+1 = d(ẑt+1), zt+1 = q(ht+1, st+1). The fine-tuning objective is Lfine-tune = ∥ẑt − zt∥2 + α∥st − d(zt)∥2 + λ∥w∥2, where ∥w∥ is the L2 norm of model parameters. In other words, during pre-training the latent actions come from the extractor g, and during fine-tuning the implicit actions are obtained by transforming explicit commands through the adapter f.
Predictive path-following with a learned walking world model. Left: an MPC loop samples command sequences, scores them with the world model under a multi-objective reward with safety constraints, closes the loop with the user's current state, and executes the first action of the best sequence. Right: the world model maintains a recurrent latent state to encode history, advances the latent via a transition model, and decodes it to predict the next user state.
World model inside the MPC loop. After fine-tuning, the learned world model p̂ approximates the unknown user dynamics p. At time t, for a candidate sequence of discrete guidance commands {at, …, at+H−1}, the controller rolls out future states ŝt+1, …, ŝt+H recursively: ŝt+1 = p̂(st, at), ŝt+k = p̂(ŝt+k−1, at+k−1) for k = 2,…,H. The corresponding 3D positions pt+k are obtained by projecting the predicted states with a fixed matrix Tp.
MPC objective and safety constraints.
The planner selects the sequence
{at, …, at+H−1}
to maximize a cumulative reward R that balances
path tracking, command frequency, goal reaching and safety:
R =
∑k=1H−1
( −∥pt+k − c(pt+k, P*)∥2
− ω1∥at+k∥2 )
− ω2∥pt+H − p*n∥2
+ ω3
∑k=1H 1 / e(pt+k, Ω).
Here, c(·, P*) returns the closest point on the
reference path, e(·, Ω) is the distance to the nearest
obstacle in the obstacle set Ω, and
ω1, ω2, ω3 are weighting
coefficients. The constraints enforce that the predicted motion
is generated by the world model, trajectories remain collision free,
and commands lie in the discrete set {−1, 0, 1} corresponding to
"turn left", "go straight" and "turn right".
At each time step, only the first action
a*t of the optimized sequence is sent as the
guidance command, and the optimization is repeated in a
receding-horizon manner.
Indoor Test
Indoor path-following experiments with trajectories, speed profiles, and quantitative metrics.
(a) Visualization of M1’s walking trajectories under three methods: Stanley, Pure Pursuit, and our Walking world model.
(b) Visualization of V3’s walking speed profiles under three methods: Stanley, Pure Pursuit, and our Walking world model.
(c) Visualization of V1’s walking trajectories under three methods: Stanley, Pure Pursuit, and our Walking world model.
(d) Visualization of V1’s walking speed profiles under three methods: Stanley, Pure Pursuit, and our Walking world model.
(e) Visualization of M1’s walking trajectories under three methods: Stanley, Pure Pursuit, and our Walking world model.
(f) Visualization of M1’s walking speed profiles under three methods: Stanley, Pure Pursuit, and our Walking world model.
Indoor Scenario Results
| Method | Velocity (m/s) ↑ | Collisions (/trial) ↓ | Travel Length (m) | |||
|---|---|---|---|---|---|---|
| EM | VI | EM | VI | EM | VI | |
| Stanley | 59.7 | 58.0 | 0.53 | 0.51 | 29.2 | 28.5 |
| Pure Pursuit | 63.4 | 56.6 | 0.51 | 0.50 | 30.1 | 27.8 |
| Walking-WM (ours) | 49.3 | 45.1 | 0.65 | 0.66 | 31.2 | 29.5 |
Table 1. Comparison of path-following approaches under static map. Bold indicates the best performance for each metric. ``EM'' stands for the eye-masked scenario, while ``VI'' stands for the visually impaired.
| Method | COV of walking speed (mean ± std) |
|---|---|
| Pure Pursuit | 0.28 ± 0.17 |
| Stanley | 0.31 ± 0.19 |
| Walking-WM (ours) | 0.19 ± 0.10 |
Table 2. Coefficient of variation (COV) of walking speed across indoor trials. Lower COV indicates more stable walking speed.
Outdoor Test
Dynamic real-world path-following experiments on two daily-life routes, with objective performance and subjective evaluations.
(a) Layouts of Route 1(240m) and Route 2(300m), with key street views marked along each path.
Outdoor Scenario Results
| Method | Velocity (m/s) ↑ | Collisions (/trial) ↓ | ||
|---|---|---|---|---|
| EM | VI | EM | VI | |
| Cane+App | 0.62 | 0.62 | 0.74 | 0.71 | 1.00 | 1.50 | 0.75 | 0.75 |
| System-PP | 0.53 | 0.55 | 0.58 | 0.54 | 0.50 | 0.75 | 0.25 | 0.75 |
| System-WM (ours) | 0.67 | 0.70 | 0.74 | 0.73 | 0.25 | 0.25 | 0.25 | 0.50 |
Table 3. Walking performance on two outdoor routes. Each cell reports Route 1 | Route 2. Higher velocity and fewer collisions are better. “EM” denotes eye-masked participants, and “VI” denotes visually impaired participants.
Subjective Evaluation (Likert Scale)
(b) Subjective ratings of safety, cognitive ease, and helpfulness for Pure Pursuit and Walking-WM, using a 7-point Likert scale (7 = very safe / very easy / very helpful). Bars represent mean ratings with standard error.
SWORD Workload Comparison
| Participant | V5 | V6 | V7 | V8 | M5 | M6 | M7 | M8 |
| Dominance score (Walking-WM vs Pure Pursuit) | -1 | -1 | 0 | -2 | -1 | -2 | -2 | -1 |
Table 4. SWORD dominance scores of perceived workload when comparing Walking-WM to Pure Pursuit. Negative scores indicate that Walking-WM is perceived as less demanding, positive scores indicate more demanding.
Open-Loop Ablation Study
Ablation on the Walking world model's training, highlighting the effect of the learning-based method and the usage of large amount of unannotated data.
Comparison between different model configurations in open-loop experiments
| Model | Pre-training | Prediction ADE (m) ↓ | Prediction FDE (m) ↓ |
|---|---|---|---|
| Hybrid model | No | 0.91 ± 0.10 | 1.69 ± 0.21 |
| World model w/o pre-train | No | 0.44 ± 0.04 | 0.82 ± 0.09 |
| World model w/ pre-train (ours) | Yes | 0.39 ± 0.03 | 0.73 ± 0.07 |
Table 3. Ablation results on different model variants. Hybrid model, a hand-crafted dynamics model adapted from the Bicycle Model with discrete “turn left / go straight / turn right” commands, assuming constant speed, a fixed steering angle and a fixed turning duration; “World model w/o pre-train” trains the world model only on action-annotated data; “World model w/ pre-train” first pre-trains on large-scale free-walking data and is then fine-tuned on user-specific action-annotated data.
What does the world model adds to the planning compared to the basic model (hybrid model)?
We conducted an experimental test using the above hybrid model on our collected annotated walking data. The test results are shown in the Table 3. These results demonstrate that the hybrid model produces high prediction errors and performs worse than world model. To investigate what our world model learned beyond this basic baseline, we provide some visual comparisons of representative prediction outcomes in figures as below to determine what the world model has actually learned, with detailed analysis as below:
(a) Periodic variations in lateral position during go straight motion
(b) Steering compensation due to extended reaction time
(c) Inconsistent steering angle within a short period
Compared with mechanical systems (such as cars), human walking exhibits several unique characteristics, as illustrated above. When walking straight, a person’s center of mass oscillates periodically; during turning, due to reaction delays, a person may continue turning even after a “go straight” command has been issued; and within a very short time window, identical commands can still lead to different responses. The world model, being learned from data, is able to capture and model these human walking characteristics, whereas the basic model fails to do so, resulting in larger prediction errors.
Does extending the planning horizon yield better results for the basic model?
(d) Map for the extra static map experiment and a user’s walking trajectory using a basic hybrid model with planning horizon of 10 and Walking-WM with planning horizon of 5.
| Method | Walking time (s) ↓ | Command frequency (/m) ↓ |
|---|---|---|
| Pure Pursuit | 84.5 | 0.62 |
| Hybrid Model (H=5) | 72.1 | 0.52 |
| Hybrid model (H=10) | 99.1 | 0.71 |
| Walking-WM (H=5, ours) | 66.3 | 0.46 |
Table 4. Results for the extra static map experiment.
To verify this statement, we conducted a new set of static map experiments. We deployed the map shown in Figure (d), which is more complex than the original static map and contains more obstacles. In this experiment, the hybrid model with a planning horizon of 10 resulted in the longest average completion time and highest command frequency, whereas the world model achieved the shortest time and lowest frequency. The poorer performance of the hybrid model with the longest horizon was mainly due to large prediction errors, which caused substantial deviations between the user’s actual trajectory and the system’s predictions, often bringing the user dangerously close to obstacles in our complex static map. To guide the user back to safety, the system issued more turning commands for obstacle avoidance.
Real-world Walking Test Video
Demonstration videos of our Walking world model in real-world walking tests.
BibTeX
@article{ju2025walking,
title = {Walking World Model for Visually Impaired Path Following},
author = {Ju, Haokun and Zhang, Lixuan and Cao, Xiangyu and Kan, Meina and Shan, Shiguang and Chen, Xilin},
journal = {IEEE Robotics and Automation Letters},
year = {2025},
volume = {X},
number = {Y},
pages = {1--8},
note = {To appear},
url = {https://haokunju.github.io/walking-world-model}
}