VLA Robot Research · Pipeline Docs

Agentic DT

Vision-Language-Action pipeline — from data generation to model training.

This page documents the full cube_pick pipeline — from simulation captureexport to a LeRobot datasetformat alignment / conversionSmolVLA trainingclosed-loop evaluation — with every explanation tied to its exact source location.

vector-os-nano/ (nested repo · SDK) VLA/ (outer research repo)

Pipeline Overview

Four stages form one pipeline. scripts/run_pipeline.sh (outer repo) chains them end-to-end: collect 50 episodes → export → train → evaluate.

flowchart TB subgraph S1["① Data Generation (sim)"] direction TB A1["collect_cube_pick.py
MuJoCo + PickSkill"] A2["RawEpisodeWriter
ArmTap / GripperTap"] A3["sim_sync hook @30Hz
render 3 cams + read joints"] A1 --> A2 --> A3 end subgraph RAW["raw_episodes/"] R1["commands.parquet
observations.parquet
videos/*.mp4 + *_meta.parquet
meta.json"] end subgraph S2["② Export / Convert"] B1["export_to_lerobot.py"] B2["LeRobotExporter
30Hz timeline alignment"] B3["action[t]=state[t+1]
single-step (6,) + action_is_pad"] B1 --> B2 --> B3 end subgraph DS["LeRobot v3 dataset"] D1["meta/ · data/chunk-000
videos/observation.images.*"] end subgraph S3["③ Training"] C1["lerobot-train
SmolVLA base + LoRA r=32"] C2["chunk_size=50
bs=64 (OOM→32→16)"] C1 --> C2 end subgraph S4["④ Evaluation"] E1["eval_and_serve.py
closed-loop rollout"] end S1 --> RAW --> S2 --> DS --> S3 --> CKPT["LoRA checkpoint
pretrained_model/"] --> S4 style S1 fill:#ddf4ff,stroke:#0969da,color:#1f2328 style S2 fill:#f3eefc,stroke:#8250df,color:#1f2328 style S3 fill:#dafbe1,stroke:#1a7f37,color:#1f2328 style S4 fill:#fff8c5,stroke:#9a6700,color:#1f2328 style RAW fill:#f6f8fa,stroke:#d0d7de,color:#1f2328 style DS fill:#f6f8fa,stroke:#d0d7de,color:#1f2328
⚠ The flowchart is rendered by Mermaid and needs a CDN script (network access). Offline, follow the per-stage text below in order: collect → raw_episodes → export → LeRobot dataset → train → checkpoint → evaluate.
Data generation Export / convert Training Evaluation

Data Generation — Simulation Capture

Inside MuJoCo, PickSkill performs the grasp automatically while every joint state, gripper state, camera frame, and the high-level instruction are recorded frame by frame. Successful episodes are kept; failed ones are deleted outright.

vector-os-nano/scripts/collection/tasks/sim/collect_cube_pick.py vector-os-nano/vector_os_nano/data/recorder.py

1.1 One episode, end to end

Orchestration — _run_episode_skill()

  • _reset_scene() places the cube at the scheduled XY, arm returns home (not recorded)
  • writer.start_episode(instruction="pick up the red cube")
  • PickSkill().execute({"object_label":"red cube","mode":"hold"})
  • ④ Success test: result.success and tap_gripper.is_holding()
  • tap_gripper.open() release (still inside the recording)

Both conditions must hold — collect_cube_pick.py:245.

Keep / discard — main() loop

  • ep_dir = writer.stop_episode() writes parquet + meta.json
  • success → writer.finalize_episode(ep_dir) encodes MP4 and keeps it
  • failure → shutil.rmtree(ep_dir) + drop its pending-encode entry
  • end → writer.finalize_all() encodes any remaining MP4

Keep/discard logic — collect_cube_pick.py:464–483.

1.2 Sampling positions & diversity

The default run is --episodes=50. Five fixed XY positions are cycled in blocks of RUNS_PER_POSITION=2 (the position index advances every 2 episodes), so 50 episodes yield 10 per XY. The file also contains a deprecated hand-written trajectory _run_episode() (with XY/timing noise) kept only for reference — not used for new captures.

CUBE_POSITIONS_XY = [
    (0.18,  0.00),   # close center
    (0.22,  0.12),   # center-left
    (0.22, -0.12),   # center-right
    (0.28,  0.06),   # far-left
    (0.28, -0.06),   # far-right
]
RUNS_PER_POSITION = 2          # collect_cube_pick.py:70–77
# pos_idx = ep_idx // RUNS_PER_POSITION  → block of 2 episodes per XY

1.3 sim_sync — atomic state/image alignment

The hard part of capture is keeping images and joint state aligned. Simulation uses mode="sim_sync": no background threads — a hook fires on each MuJoCo step and, from the same mjData, renders the cameras and reads the joints together, so the image-to-state time delta is exactly 0.

Command taps (transparent proxies)

ArmTap / GripperTap wrap the real arm/gripper, pass every call straight through, and only append a log entry for move_joints / move_cartesian / open / close. They change no timing or behavior.

recorder.py:55 (ArmTap) · :146 (GripperTap)

Synthetic timestamps — _build_sim_hook()

Headless sim runs faster than real time, so wall-clock timestamps cluster. The hook instead emits a synthetic uniform timestamp t_ns = episode_start + sample_id × period, shared by both state and image frames (delta = 0).

recorder.py:490–548

Rules: do not open the dashboard during capture (EGL render contention corrupts data); once recording starts, drive motion through PickSkill + tap_arm/tap_gripper, not hand-written trajectories. (The pre-episode reset/home calls arm.move_joints() on the raw arm directly, outside the recording — that is intentional.)

1.4 Output: raw_episodes/ layout

raw_episodes/
└── episode_0000/
    ├── meta.json              # instruction, spec_version, t_start_ns, sampling_mode, cube_initial_pos …
    ├── commands.parquet       # every move_joints / gripper command + timestamp
    ├── observations.parquet   # 30Hz: 5 joint angles + 1 gripper + t_ns + t_sim_s
    └── videos/
        ├── camera1.mp4        # overhead   (camera1 = OBS_IMAGE_1)
        ├── camera1_meta.parquet
        ├── camera2.mp4        # wrist_cam  (camera2 = OBS_IMAGE_2)
        ├── camera2_meta.parquet
        ├── camera3.mp4        # front      (camera3 = OBS_IMAGE_3)
        └── camera3_meta.parquet

Camera-name mapping: CAMERA_MAP (collect_cube_pick.py:93). MP4 encoded via imageio-ffmpeg (libx264, crf 23).

Export & Convert — raw → LeRobot v3

Convert raw_episodes/ into a standard LeRobot Dataset v3. This step is task-agnostic; its core job is to resample each episode onto a strict 30Hz timeline and to generate the action labels used for training.

vector-os-nano/scripts/collection/export_to_lerobot.py vector-os-nano/vector_os_nano/data/exporter.py

2.1 30Hz timeline resampling

A uniform set of timestamps is generated across the observation span; at each timestamp the nearest observation and image are taken. The exporter handles arbitrary raw episode timing in general; note that current sim_sync cube_pick data is already uniformly stamped at 30Hz, so the resampling is close to a pass-through for this data.

t_start, t_end = obs_ts[0], obs_ts[-1]
period_ns = 1e9 / fps                       # fps = 30
n_frames  = max(1, (t_end - t_start) / period_ns)
frame_times = [t_start + i*period_ns for i in range(n_frames)]
# per frame_time: nearest observation → observation.state (6,)
#                 nearest image       → observation.images.*
#                                       exporter.py:160–217

2.2 Two timestamp domains

TimestampFromMeaningUsed for
t_nsobservations.parquet synthetic uniform timestamp (sim_sync: shared by state & images) building the 30Hz timeline; nearest-state lookup
t_host_nscameraX_meta.parquet per-frame image timestamp nearest-image alignment + staleness check
Image alignment: for each 30Hz timestamp, find the nearest frame by t_host_ns; if the gap > --image-tolerance-ms (default 50ms) the frame is marked stale and counted in the report. exporter.py:96, 423–446
Caveat on t_host_ns: in sim_sync the recorder writes the same synthetic t_ns into the image meta, so image/state alignment is exact by construction. Only in async/real-hardware capture is t_host_ns a true host clock value, which is when the 50ms staleness check actually does work. recorder.py:541–543

2.3 Action label — next_state mode

Default --action-mode next_state: the next observed state is used as the current action target.

Semantics

  • action[t] = observed_state[t+1]
  • absolute joint positions (not deltas, not control commands)
  • stored single-step, shape (6,) = 5 joints + 1 gripper
  • do not pre-chunk: at train time the dataloader stacks chunks via delta_timestamps

Terminal padding

  • idx = min(i+1, n-1) — the last frame's action points at itself
  • action_is_pad = 1.0 marks that frame as padding (0.0 otherwise)
  • padded frames are counted in report.padded_frames

exporter.py:185–193

Legacy mode command_latch: action = last skill command target, forced to horizon=1. New data always uses next_state.

2.4 Output: LeRobot v3 dataset layout

$DATASET_ROOT/                            # = export_to_lerobot.py --output-root
├── meta/
│   ├── info.json            # feature spec (observation.state / action / images)
│   ├── stats.json           # per-dimension stats (used for normalization at train time)
│   ├── tasks.parquet        # instruction text table
│   └── episodes/chunk-000/file-000.parquet
├── data/chunk-000/file-000.parquet      # per-frame state / action / action_is_pad
└── videos/observation.images.<cam>/chunk-000/file-000.mp4
Where does the dataset go? --output-root is used as the final dataset directory exactly as given — export_to_lerobot.py passes root=output_root straight into LeRobotDataset.create and LeRobot does not append the repo-id. If --output-root is omitted, the exporter defaults to DEFAULT_LEROBOT_ROOT/<repo-id> (/home/zekun/vla_data/datasets/<repo-id>). exporter.py:73–77

Built by LeRobotDataset.create(robot_type="so101_follower"); frames added via dataset.add_frame(), finalized with dataset.save_episode(). Feature spec: _build_features() (exporter.py:246).

Upload to HuggingFace after export (needed for HPC training): huggingface-cli upload <user>/cube_pick <dataset_dir> --repo-type dataset. Train/val splits must be done by episode, never by frame, to avoid leakage.

Training — SmolVLA + LoRA

Fine-tune SmolVLA on the LeRobot dataset with lerobot-train. Base lerobot/smolvla_base, LoRA r=32, inference chunk of 50 steps. The three run environments share the same hyperparameters.

scripts/run_pipeline.sh scripts/train_cube_pick.sh hpc/train_cube_pick.job

3.1 Core training command

lerobot-train \
  --policy.type=smolvla \
  --policy.pretrained_path=lerobot/smolvla_base \
  --policy.push_to_hub=false \
  --dataset.repo_id=local/cube_pick \
  --dataset.root=$DATASET_ROOT \
  --output_dir=$CKPT_DIR \
  --steps=100000  --batch_size=64 \
  --save_freq=10000  --eval_freq=0 \
  --policy.use_amp=true \
  --policy.chunk_size=50 \
  --peft.method_type=LORA  --peft.r=32

3.2 OOM auto-downgrade (shell retry, not a trainer feature)

The script first tries bs=64. If the process fails and the log matches the literal lowercase string "out of memory", it retries at bs=32, then bs=16. To keep the effective number of gradient updates constant, steps scale up proportionally.

steps = BASE_STEPS * BASE_BS / bs        # BASE_STEPS=100000, BASE_BS=64
  bs=64 → 100000 steps
  bs=32 → 200000 steps   (×2)
  bs=16 → 400000 steps   (×4)
# detect + retry: run_pipeline.sh:110–123 / train_cube_pick.sh:67–82
This is bash-level try/retry, triggered by a case-sensitive grep -q "out of memory" on the training log — not a trainer-internal auto batch reduction. It matches any log line containing that exact lowercase phrase (so "CUDA out of memory" matches), but an error phrased only as "OutOfMemoryError" would not. On OOM, run_pipeline.sh and train_cube_pick.sh delete the half-baked checkpoint dir before retrying; hpc/train_cube_pick.job retries without deleting OUTPUT_DIR.

3.3 Three run environments

EnvironmentEntry pointData sourceNotes
Local GPU, one-clickscripts/run_pipeline.sh local export outputcollect→export→train→eval chained, with timing + logs
Local train-onlyscripts/train_cube_pick.sh local dataset dirtraining only, with OOM downgrade
UCL Myriad HPCqsub hpc/train_cube_pick.job download from HuggingFaceSGE/qsub job (#$ directives); token in home, big caches on scratch
From the experiment log (EXPERIMENT_LOG.md): UCL Myriad V100 was ~47s/step (too slow), so training moved to a DGX GB10 (Grace Blackwell, 128GB unified memory) run as an nohup background process. See EXP-001 / EXP-002 for the actual runs.

Evaluation — closed-loop rollout

After training, eval_and_serve.py runs a closed-loop evaluation in simulation: it loads the LoRA checkpoint, rolls out over several seeds, records videos, computes a success rate, and (by default) starts a web viewer.

scripts/eval_and_serve.py

python scripts/eval_and_serve.py \
    --model $CKPT_DIR/checkpoints/<step>/pretrained_model \
    --instruction "pick up the red cube" \
    --seeds 0 1 2 3 4 --max-steps 150 \
    --out-dir $EVAL_DIR
Argument note: eval_and_serve.py accepts --model --instruction --seeds --max-steps --scene --lift-threshold --obj-rest-z --out-dir --port --no-serve --ffmpeg. There is no --task flag — passing it (as the docstring examples and run_pipeline.sh:149 currently do) makes argparse fail. See the note in the project summary.

sim vs real inference

Settingn_action_steps
Simulation1 — re-infer every step on the freshest observation; smoothest trajectory
Real arm50 — execute a full chunk, then re-infer (per SmolVLA paper)

Set at load/deploy time via load_finetuned_policy() — not a CLI flag of this script.

Artifacts

  • results.json — per-seed success/failure + success rate
  • ep0X_sN_SUCCESS/FAIL.mp4 — per-episode video
  • combined.mp4 — merged video
  • web viewer on port 2310 by default; --no-serve records only (the one-click pipeline uses --no-serve)
Browser playback needs H.264. _reencode_h264() transcodes mp4v→H.264; if ffmpeg is missing it falls back to keeping mp4v and prints a warning (no crash). eval_and_serve.py:106

Data Structures & Schemas

These schemas come straight from the pyarrow definitions in recorder.py — not from memory.

commands.parquet · recorder.py:393

FieldType
cmd_idint32
t_call_nsint64
t_sent_nsint64 (reserved)
cmd_typestring
joint_targetlist<float32>
gripper_targetfloat32
duration_sfloat32

cmd_type ∈ initial_state / move_joints / move_cartesian / gripper_open / gripper_close

observations.parquet · recorder.py:403

FieldType
sample_idint32
t_nsint64
t_sim_sfloat64 (null for real-hw)
joint_pos_0 … 4float32 ×5
gripper_posfloat32

6-dim state = 5 joints + 1 gripper, matching STATE_NAMES (exporter.py:33): shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper

Key Parameters

ParameterValueStageMeaning
STATE_HZ30capturestate + image sampling rate (shared in sim_sync)
episodes50capture5 positions cycled in blocks of RUNS_PER_POSITION (=2) → 10 per XY
--action-modenext_stateexportaction[t] = observed_state[t+1]
--image-tolerance-ms50.0exportimage-vs-state gap above this is marked stale (effective in async/real)
fps30exportoutput timeline frequency
chunk_size50train/inferpredict 50 future action steps per inference
peft.r32trainLoRA rank (EXP-002 used 16)
batch_size64→32→16trainOOM downgrade; steps scale inversely
steps100000trainbaseline at bs=64; scaled up when bs drops
n_action_steps1 / 50eval/deploysim=1, real=50; set in load_finetuned_policy()
--port2310evalweb viewer default port

A text version of the full flow with commands lives in the VLA project repo (PIPELINE.md); experiment records in EXPERIMENT_LOG.md.