Vision-Language-Action pipeline — from data generation to model training.
This page documents the full cube_pick pipeline — from simulation capture → export to a LeRobot dataset → format alignment / conversion → SmolVLA training → closed-loop evaluation — with every explanation tied to its exact source location.
vector-os-nano/ (nested repo · SDK) VLA/ (outer research repo)
Four stages form one pipeline. scripts/run_pipeline.sh (outer repo) chains them
end-to-end: collect 50 episodes → export → train → evaluate.
Inside MuJoCo, PickSkill performs the grasp automatically while every joint state,
gripper state, camera frame, and the high-level instruction are recorded frame by frame.
Successful episodes are kept; failed ones are deleted outright.
vector-os-nano/scripts/collection/tasks/sim/collect_cube_pick.py vector-os-nano/vector_os_nano/data/recorder.py
_run_episode_skill()_reset_scene() places the cube at the scheduled XY, arm returns home (not recorded)writer.start_episode(instruction="pick up the red cube")PickSkill().execute({"object_label":"red cube","mode":"hold"})result.success and tap_gripper.is_holding()tap_gripper.open() release (still inside the recording)Both conditions must hold —
collect_cube_pick.py:245.
main() loopep_dir = writer.stop_episode() writes parquet + meta.jsonwriter.finalize_episode(ep_dir) encodes MP4 and keeps itshutil.rmtree(ep_dir) + drop its pending-encode entrywriter.finalize_all() encodes any remaining MP4Keep/discard logic —
collect_cube_pick.py:464–483.
The default run is --episodes=50. Five fixed XY positions are cycled in blocks of
RUNS_PER_POSITION=2 (the position index advances every 2 episodes), so 50 episodes yield
10 per XY. The file also contains a deprecated hand-written trajectory _run_episode()
(with XY/timing noise) kept only for reference — not used for new captures.
CUBE_POSITIONS_XY = [
(0.18, 0.00), # close center
(0.22, 0.12), # center-left
(0.22, -0.12), # center-right
(0.28, 0.06), # far-left
(0.28, -0.06), # far-right
]
RUNS_PER_POSITION = 2 # collect_cube_pick.py:70–77
# pos_idx = ep_idx // RUNS_PER_POSITION → block of 2 episodes per XY
The hard part of capture is keeping images and joint state aligned. Simulation uses
mode="sim_sync": no background threads — a hook fires on each MuJoCo step and,
from the same mjData, renders the cameras and reads the joints together, so the
image-to-state time delta is exactly 0.
ArmTap / GripperTap wrap the real arm/gripper, pass every call straight
through, and only append a log entry for move_joints / move_cartesian /
open / close. They change no timing or behavior.
recorder.py:55 (ArmTap) · :146 (GripperTap)
_build_sim_hook()Headless sim runs faster than real time, so wall-clock timestamps cluster. The hook instead emits a
synthetic uniform timestamp t_ns = episode_start + sample_id × period, shared by
both state and image frames (delta = 0).
recorder.py:490–548
PickSkill + tap_arm/tap_gripper, not hand-written
trajectories. (The pre-episode reset/home calls arm.move_joints() on the raw arm directly, outside the recording — that is intentional.)raw_episodes/
└── episode_0000/
├── meta.json # instruction, spec_version, t_start_ns, sampling_mode, cube_initial_pos …
├── commands.parquet # every move_joints / gripper command + timestamp
├── observations.parquet # 30Hz: 5 joint angles + 1 gripper + t_ns + t_sim_s
└── videos/
├── camera1.mp4 # overhead (camera1 = OBS_IMAGE_1)
├── camera1_meta.parquet
├── camera2.mp4 # wrist_cam (camera2 = OBS_IMAGE_2)
├── camera2_meta.parquet
├── camera3.mp4 # front (camera3 = OBS_IMAGE_3)
└── camera3_meta.parquet
Camera-name mapping: CAMERA_MAP (collect_cube_pick.py:93).
MP4 encoded via imageio-ffmpeg (libx264, crf 23).
Convert raw_episodes/ into a standard LeRobot Dataset v3. This step is task-agnostic; its core job is to resample each episode onto a strict 30Hz timeline and to generate the action labels used for training.
vector-os-nano/scripts/collection/export_to_lerobot.py vector-os-nano/vector_os_nano/data/exporter.py
A uniform set of timestamps is generated across the observation span; at each timestamp the
nearest observation and image are taken. The exporter handles arbitrary raw episode timing in
general; note that current sim_sync cube_pick data is already uniformly stamped at 30Hz, so the
resampling is close to a pass-through for this data.
t_start, t_end = obs_ts[0], obs_ts[-1] period_ns = 1e9 / fps # fps = 30 n_frames = max(1, (t_end - t_start) / period_ns) frame_times = [t_start + i*period_ns for i in range(n_frames)] # per frame_time: nearest observation → observation.state (6,) # nearest image → observation.images.* # exporter.py:160–217
| Timestamp | From | Meaning | Used for |
|---|---|---|---|
t_ns | observations.parquet | synthetic uniform timestamp (sim_sync: shared by state & images) | building the 30Hz timeline; nearest-state lookup |
t_host_ns | cameraX_meta.parquet | per-frame image timestamp | nearest-image alignment + staleness check |
t_host_ns;
if the gap > --image-tolerance-ms (default 50ms) the frame is marked stale
and counted in the report. exporter.py:96, 423–446t_host_ns: in sim_sync the recorder writes the
same synthetic t_ns into the image meta, so image/state alignment is exact by construction.
Only in async/real-hardware capture is t_host_ns a true host clock value, which is when
the 50ms staleness check actually does work. recorder.py:541–543Default --action-mode next_state: the next observed state is used as the current
action target.
action[t] = observed_state[t+1](6,) = 5 joints + 1 gripperdelta_timestampsidx = min(i+1, n-1) — the last frame's action points at itselfaction_is_pad = 1.0 marks that frame as padding (0.0 otherwise)report.padded_framesexporter.py:185–193
command_latch: action = last skill command target, forced to
horizon=1. New data always uses next_state.$DATASET_ROOT/ # = export_to_lerobot.py --output-root ├── meta/ │ ├── info.json # feature spec (observation.state / action / images) │ ├── stats.json # per-dimension stats (used for normalization at train time) │ ├── tasks.parquet # instruction text table │ └── episodes/chunk-000/file-000.parquet ├── data/chunk-000/file-000.parquet # per-frame state / action / action_is_pad └── videos/observation.images.<cam>/chunk-000/file-000.mp4
--output-root is used as the final dataset
directory exactly as given — export_to_lerobot.py passes root=output_root straight into
LeRobotDataset.create and LeRobot does not append the repo-id. If --output-root is
omitted, the exporter defaults to DEFAULT_LEROBOT_ROOT/<repo-id>
(/home/zekun/vla_data/datasets/<repo-id>). exporter.py:73–77Built by LeRobotDataset.create(robot_type="so101_follower");
frames added via dataset.add_frame(), finalized with dataset.save_episode().
Feature spec: _build_features() (exporter.py:246).
huggingface-cli upload <user>/cube_pick <dataset_dir> --repo-type dataset.
Train/val splits must be done by episode, never by frame, to avoid leakage.Fine-tune SmolVLA on the LeRobot dataset with lerobot-train. Base
lerobot/smolvla_base, LoRA r=32, inference chunk of 50 steps. The three run environments share the
same hyperparameters.
scripts/run_pipeline.sh scripts/train_cube_pick.sh hpc/train_cube_pick.job
lerobot-train \ --policy.type=smolvla \ --policy.pretrained_path=lerobot/smolvla_base \ --policy.push_to_hub=false \ --dataset.repo_id=local/cube_pick \ --dataset.root=$DATASET_ROOT \ --output_dir=$CKPT_DIR \ --steps=100000 --batch_size=64 \ --save_freq=10000 --eval_freq=0 \ --policy.use_amp=true \ --policy.chunk_size=50 \ --peft.method_type=LORA --peft.r=32
The script first tries bs=64. If the process fails and the log matches the literal lowercase string "out of memory", it retries at bs=32, then bs=16. To keep the effective number of gradient updates constant, steps scale up proportionally.
steps = BASE_STEPS * BASE_BS / bs # BASE_STEPS=100000, BASE_BS=64 bs=64 → 100000 steps bs=32 → 200000 steps (×2) bs=16 → 400000 steps (×4) # detect + retry: run_pipeline.sh:110–123 / train_cube_pick.sh:67–82
grep -q "out of memory" on the training log — not a trainer-internal auto batch reduction.
It matches any log line containing that exact lowercase phrase (so "CUDA out of memory" matches), but an error
phrased only as "OutOfMemoryError" would not.
On OOM, run_pipeline.sh and train_cube_pick.sh delete the half-baked checkpoint dir
before retrying; hpc/train_cube_pick.job retries without deleting OUTPUT_DIR.| Environment | Entry point | Data source | Notes |
|---|---|---|---|
| Local GPU, one-click | scripts/run_pipeline.sh |
local export output | collect→export→train→eval chained, with timing + logs |
| Local train-only | scripts/train_cube_pick.sh |
local dataset dir | training only, with OOM downgrade |
| UCL Myriad HPC | qsub hpc/train_cube_pick.job |
download from HuggingFace | SGE/qsub job (#$ directives); token in home, big caches on scratch |
nohup background process.
See EXP-001 / EXP-002 for the actual runs.After training, eval_and_serve.py runs a closed-loop evaluation in simulation:
it loads the LoRA checkpoint, rolls out over several seeds, records videos, computes a success rate, and
(by default) starts a web viewer.
scripts/eval_and_serve.py
python scripts/eval_and_serve.py \
--model $CKPT_DIR/checkpoints/<step>/pretrained_model \
--instruction "pick up the red cube" \
--seeds 0 1 2 3 4 --max-steps 150 \
--out-dir $EVAL_DIR
eval_and_serve.py accepts
--model --instruction --seeds --max-steps --scene --lift-threshold --obj-rest-z --out-dir --port --no-serve --ffmpeg.
There is no --task flag — passing it (as the docstring examples and
run_pipeline.sh:149 currently do) makes argparse fail. See the note in the project summary.| Setting | n_action_steps |
|---|---|
| Simulation | 1 — re-infer every step on the freshest observation; smoothest trajectory |
| Real arm | 50 — execute a full chunk, then re-infer (per SmolVLA paper) |
Set at load/deploy time via
load_finetuned_policy() — not a CLI flag of this script.
results.json — per-seed success/failure + success rateep0X_sN_SUCCESS/FAIL.mp4 — per-episode videocombined.mp4 — merged video2310 by default; --no-serve records only
(the one-click pipeline uses --no-serve)_reencode_h264() transcodes mp4v→H.264;
if ffmpeg is missing it falls back to keeping mp4v and prints a warning (no crash).
eval_and_serve.py:106These schemas come straight from the pyarrow definitions in recorder.py — not from memory.
commands.parquet · recorder.py:393| Field | Type |
|---|---|
| cmd_id | int32 |
| t_call_ns | int64 |
| t_sent_ns | int64 (reserved) |
| cmd_type | string |
| joint_target | list<float32> |
| gripper_target | float32 |
| duration_s | float32 |
cmd_type ∈ initial_state / move_joints / move_cartesian / gripper_open / gripper_close
observations.parquet · recorder.py:403| Field | Type |
|---|---|
| sample_id | int32 |
| t_ns | int64 |
| t_sim_s | float64 (null for real-hw) |
| joint_pos_0 … 4 | float32 ×5 |
| gripper_pos | float32 |
6-dim state = 5 joints + 1 gripper, matching STATE_NAMES (exporter.py:33): shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper
| Parameter | Value | Stage | Meaning |
|---|---|---|---|
STATE_HZ | 30 | capture | state + image sampling rate (shared in sim_sync) |
| episodes | 50 | capture | 5 positions cycled in blocks of RUNS_PER_POSITION (=2) → 10 per XY |
--action-mode | next_state | export | action[t] = observed_state[t+1] |
--image-tolerance-ms | 50.0 | export | image-vs-state gap above this is marked stale (effective in async/real) |
| fps | 30 | export | output timeline frequency |
chunk_size | 50 | train/infer | predict 50 future action steps per inference |
peft.r | 32 | train | LoRA rank (EXP-002 used 16) |
batch_size | 64→32→16 | train | OOM downgrade; steps scale inversely |
steps | 100000 | train | baseline at bs=64; scaled up when bs drops |
n_action_steps | 1 / 50 | eval/deploy | sim=1, real=50; set in load_finetuned_policy() |
--port | 2310 | eval | web viewer default port |
A text version of the full flow with commands lives in the
VLA project repo (PIPELINE.md); experiment records in EXPERIMENT_LOG.md.