Agentic DT

⤳Pipeline Overview

Four stages form one pipeline. scripts/run_pipeline.sh (outer repo) chains them end-to-end: collect 50 episodes → export → train → evaluate.

flowchart TB subgraph S1["① Data Generation (sim)"] direction TB A1["collect_cube_pick.py
MuJoCo + PickSkill"] A2["RawEpisodeWriter
ArmTap / GripperTap"] A3["sim_sync hook @30Hz
render 3 cams + read joints"] A1 --> A2 --> A3 end subgraph RAW["raw_episodes/"] R1["commands.parquet
observations.parquet
videos/*.mp4 + *_meta.parquet
meta.json"] end subgraph S2["② Export / Convert"] B1["export_to_lerobot.py"] B2["LeRobotExporter
30Hz timeline alignment"] B3["action[t]=state[t+1]
single-step (6,) + action_is_pad"] B1 --> B2 --> B3 end subgraph DS["LeRobot v3 dataset"] D1["meta/ · data/chunk-000
videos/observation.images.*"] end subgraph S3["③ Training"] C1["lerobot-train
SmolVLA base + LoRA r=32"] C2["chunk_size=50
bs=64 (OOM→32→16)"] C1 --> C2 end subgraph S4["④ Evaluation"] E1["eval_and_serve.py
closed-loop rollout"] end S1 --> RAW --> S2 --> DS --> S3 --> CKPT["LoRA checkpoint
pretrained_model/"] --> S4 style S1 fill:#ddf4ff,stroke:#0969da,color:#1f2328 style S2 fill:#f3eefc,stroke:#8250df,color:#1f2328 style S3 fill:#dafbe1,stroke:#1a7f37,color:#1f2328 style S4 fill:#fff8c5,stroke:#9a6700,color:#1f2328 style RAW fill:#f6f8fa,stroke:#d0d7de,color:#1f2328 style DS fill:#f6f8fa,stroke:#d0d7de,color:#1f2328

⚠ The flowchart is rendered by Mermaid and needs a CDN script (network access). Offline, follow the per-stage text below in order: collect → raw_episodes → export → LeRobot dataset → train → checkpoint → evaluate.

Data generation Export / convert Training Evaluation

①Data Generation — Simulation Capture

Inside MuJoCo, PickSkill performs the grasp automatically while every joint state, gripper state, camera frame, and the high-level instruction are recorded frame by frame. Successful episodes are kept; failed ones are deleted outright.

vector-os-nano/scripts/collection/tasks/sim/collect_cube_pick.py vector-os-nano/vector_os_nano/data/recorder.py

1.1 One episode, end to end

Orchestration — `_run_episode_skill()`

① _reset_scene() places the cube at the scheduled XY, arm returns home (not recorded)
② writer.start_episode(instruction="pick up the red cube")
③ PickSkill().execute({"object_label":"red cube","mode":"hold"})
④ Success test: result.success and tap_gripper.is_holding()
⑤ tap_gripper.open() release (still inside the recording)

Both conditions must hold — collect_cube_pick.py:245.

Keep / discard — `main()` loop

ep_dir = writer.stop_episode() writes parquet + meta.json
success → writer.finalize_episode(ep_dir) encodes MP4 and keeps it
failure → shutil.rmtree(ep_dir) + drop its pending-encode entry
end → writer.finalize_all() encodes any remaining MP4

Keep/discard logic — collect_cube_pick.py:464–483.

1.2 Sampling positions & diversity

The default run is --episodes=50. Five fixed XY positions are cycled in blocks of RUNS_PER_POSITION=2 (the position index advances every 2 episodes), so 50 episodes yield 10 per XY. The file also contains a deprecated hand-written trajectory _run_episode() (with XY/timing noise) kept only for reference — not used for new captures.

CUBE_POSITIONS_XY = [
    (0.18,  0.00),   # close center
    (0.22,  0.12),   # center-left
    (0.22, -0.12),   # center-right
    (0.28,  0.06),   # far-left
    (0.28, -0.06),   # far-right
]
RUNS_PER_POSITION = 2          # collect_cube_pick.py:70–77
# pos_idx = ep_idx // RUNS_PER_POSITION  → block of 2 episodes per XY

1.3 sim_sync — atomic state/image alignment

The hard part of capture is keeping images and joint state aligned. Simulation uses mode="sim_sync": no background threads — a hook fires on each MuJoCo step and, from the same mjData, renders the cameras and reads the joints together, so the image-to-state time delta is exactly 0.

Command taps (transparent proxies)

ArmTap / GripperTap wrap the real arm/gripper, pass every call straight through, and only append a log entry for move_joints / move_cartesian / open / close. They change no timing or behavior.

recorder.py:55 (ArmTap) · :146 (GripperTap)

Synthetic timestamps — `_build_sim_hook()`

Headless sim runs faster than real time, so wall-clock timestamps cluster. The hook instead emits a synthetic uniform timestamp t_ns = episode_start + sample_id × period, shared by both state and image frames (delta = 0).

recorder.py:490–548

Rules: do not open the dashboard during capture (EGL render contention corrupts data); once recording starts, drive motion through PickSkill + tap_arm/tap_gripper, not hand-written trajectories. (The pre-episode reset/home calls arm.move_joints() on the raw arm directly, outside the recording — that is intentional.)

1.4 Output: raw_episodes/ layout

raw_episodes/
└── episode_0000/
    ├── meta.json              # instruction, spec_version, t_start_ns, sampling_mode, cube_initial_pos …
    ├── commands.parquet       # every move_joints / gripper command + timestamp
    ├── observations.parquet   # 30Hz: 5 joint angles + 1 gripper + t_ns + t_sim_s
    └── videos/
        ├── camera1.mp4        # overhead   (camera1 = OBS_IMAGE_1)
        ├── camera1_meta.parquet
        ├── camera2.mp4        # wrist_cam  (camera2 = OBS_IMAGE_2)
        ├── camera2_meta.parquet
        ├── camera3.mp4        # front      (camera3 = OBS_IMAGE_3)
        └── camera3_meta.parquet

Camera-name mapping: CAMERA_MAP (collect_cube_pick.py:93). MP4 encoded via imageio-ffmpeg (libx264, crf 23).

②Export & Convert — raw → LeRobot v3

Convert raw_episodes/ into a standard LeRobot Dataset v3. This step is task-agnostic; its core job is to resample each episode onto a strict 30Hz timeline and to generate the action labels used for training.

vector-os-nano/scripts/collection/export_to_lerobot.py vector-os-nano/vector_os_nano/data/exporter.py

2.1 30Hz timeline resampling

A uniform set of timestamps is generated across the observation span; at each timestamp the nearest observation and image are taken. The exporter handles arbitrary raw episode timing in general; note that current sim_sync cube_pick data is already uniformly stamped at 30Hz, so the resampling is close to a pass-through for this data.

t_start, t_end = obs_ts[0], obs_ts[-1]
period_ns = 1e9 / fps                       # fps = 30
n_frames  = max(1, (t_end - t_start) / period_ns)
frame_times = [t_start + i*period_ns for i in range(n_frames)]
# per frame_time: nearest observation → observation.state (6,)
#                 nearest image       → observation.images.*
#                                       exporter.py:160–217

2.2 Two timestamp domains

Timestamp	From	Meaning	Used for
`t_ns`	observations.parquet	synthetic uniform timestamp (sim_sync: shared by state & images)	building the 30Hz timeline; nearest-state lookup
`t_host_ns`	cameraX_meta.parquet	per-frame image timestamp	nearest-image alignment + staleness check

Image alignment: for each 30Hz timestamp, find the nearest frame by t_host_ns; if the gap > --image-tolerance-ms (default 50ms) the frame is marked stale and counted in the report. exporter.py:96, 423–446

Caveat on t_host_ns: in sim_sync the recorder writes the same synthetic t_ns into the image meta, so image/state alignment is exact by construction. Only in async/real-hardware capture is t_host_ns a true host clock value, which is when the 50ms staleness check actually does work. recorder.py:541–543

2.3 Action label — next_state mode

Default --action-mode next_state: the next observed state is used as the current action target.

Semantics

action[t] = observed_state[t+1]
absolute joint positions (not deltas, not control commands)
stored single-step, shape (6,) = 5 joints + 1 gripper
do not pre-chunk: at train time the dataloader stacks chunks via delta_timestamps

Terminal padding

idx = min(i+1, n-1) — the last frame's action points at itself
action_is_pad = 1.0 marks that frame as padding (0.0 otherwise)
padded frames are counted in report.padded_frames

exporter.py:185–193

Legacy mode command_latch: action = last skill command target, forced to horizon=1. New data always uses next_state.

2.4 Output: LeRobot v3 dataset layout

$DATASET_ROOT/                            # = export_to_lerobot.py --output-root
├── meta/
│   ├── info.json            # feature spec (observation.state / action / images)
│   ├── stats.json           # per-dimension stats (used for normalization at train time)
│   ├── tasks.parquet        # instruction text table
│   └── episodes/chunk-000/file-000.parquet
├── data/chunk-000/file-000.parquet      # per-frame state / action / action_is_pad
└── videos/observation.images.<cam>/chunk-000/file-000.mp4

Where does the dataset go? --output-root is used as the final dataset directory exactly as given — export_to_lerobot.py passes root=output_root straight into LeRobotDataset.create and LeRobot does not append the repo-id. If --output-root is omitted, the exporter defaults to DEFAULT_LEROBOT_ROOT/<repo-id> (/home/zekun/vla_data/datasets/<repo-id>). exporter.py:73–77

Built by LeRobotDataset.create(robot_type="so101_follower"); frames added via dataset.add_frame(), finalized with dataset.save_episode(). Feature spec: _build_features() (exporter.py:246).

Upload to HuggingFace after export (needed for HPC training): huggingface-cli upload <user>/cube_pick <dataset_dir> --repo-type dataset. Train/val splits must be done by episode, never by frame, to avoid leakage.

③Training — SmolVLA + LoRA

Fine-tune SmolVLA on the LeRobot dataset with lerobot-train. Base lerobot/smolvla_base, LoRA r=32, inference chunk of 50 steps. The three run environments share the same hyperparameters.

scripts/run_pipeline.sh scripts/train_cube_pick.sh hpc/train_cube_pick.job

3.1 Core training command

lerobot-train \
  --policy.type=smolvla \
  --policy.pretrained_path=lerobot/smolvla_base \
  --policy.push_to_hub=false \
  --dataset.repo_id=local/cube_pick \
  --dataset.root=$DATASET_ROOT \
  --output_dir=$CKPT_DIR \
  --steps=100000  --batch_size=64 \
  --save_freq=10000  --eval_freq=0 \
  --policy.use_amp=true \
  --policy.chunk_size=50 \
  --peft.method_type=LORA  --peft.r=32

3.2 OOM auto-downgrade (shell retry, not a trainer feature)

The script first tries bs=64. If the process fails and the log matches the literal lowercase string "out of memory", it retries at bs=32, then bs=16. To keep the effective number of gradient updates constant, steps scale up proportionally.

steps = BASE_STEPS * BASE_BS / bs        # BASE_STEPS=100000, BASE_BS=64
  bs=64 → 100000 steps
  bs=32 → 200000 steps   (×2)
  bs=16 → 400000 steps   (×4)
# detect + retry: run_pipeline.sh:110–123 / train_cube_pick.sh:67–82

This is bash-level try/retry, triggered by a case-sensitive grep -q "out of memory" on the training log — not a trainer-internal auto batch reduction. It matches any log line containing that exact lowercase phrase (so "CUDA out of memory" matches), but an error phrased only as "OutOfMemoryError" would not. On OOM, run_pipeline.sh and train_cube_pick.sh delete the half-baked checkpoint dir before retrying; hpc/train_cube_pick.job retries without deleting OUTPUT_DIR.

3.3 Three run environments

Environment	Entry point	Data source	Notes
Local GPU, one-click	`scripts/run_pipeline.sh`	local export output	collect→export→train→eval chained, with timing + logs
Local train-only	`scripts/train_cube_pick.sh`	local dataset dir	training only, with OOM downgrade
UCL Myriad HPC	`qsub hpc/train_cube_pick.job`	download from HuggingFace	SGE/qsub job (`#$` directives); token in home, big caches on scratch

From the experiment log (EXPERIMENT_LOG.md): UCL Myriad V100 was ~47s/step (too slow), so training moved to a DGX GB10 (Grace Blackwell, 128GB unified memory) run as an nohup background process. See EXP-001 / EXP-002 for the actual runs.

④Evaluation — closed-loop rollout

After training, eval_and_serve.py runs a closed-loop evaluation in simulation: it loads the LoRA checkpoint, rolls out over several seeds, records videos, computes a success rate, and (by default) starts a web viewer.

scripts/eval_and_serve.py

python scripts/eval_and_serve.py \
    --model $CKPT_DIR/checkpoints/<step>/pretrained_model \
    --instruction "pick up the red cube" \
    --seeds 0 1 2 3 4 --max-steps 150 \
    --out-dir $EVAL_DIR

Argument note: eval_and_serve.py accepts --model --instruction --seeds --max-steps --scene --lift-threshold --obj-rest-z --out-dir --port --no-serve --ffmpeg. There is no --task flag — passing it (as the docstring examples and run_pipeline.sh:149 currently do) makes argparse fail. See the note in the project summary.

sim vs real inference

Setting	`n_action_steps`
Simulation	1 — re-infer every step on the freshest observation; smoothest trajectory
Real arm	50 — execute a full chunk, then re-infer (per SmolVLA paper)

Set at load/deploy time via load_finetuned_policy() — not a CLI flag of this script.

Artifacts

results.json — per-seed success/failure + success rate
ep0X_sN_SUCCESS/FAIL.mp4 — per-episode video
combined.mp4 — merged video
web viewer on port 2310 by default; --no-serve records only (the one-click pipeline uses --no-serve)

Browser playback needs H.264. _reencode_h264() transcodes mp4v→H.264; if ffmpeg is missing it falls back to keeping mp4v and prints a warning (no crash). eval_and_serve.py:106

Field	Type
cmd_id	int32
t_call_ns	int64
t_sent_ns	int64 (reserved)
cmd_type	string
joint_target	list<float32>
gripper_target	float32
duration_s	float32

Field	Type
sample_id	int32
t_ns	int64
t_sim_s	float64 (null for real-hw)
joint_pos_0 … 4	float32 ×5
gripper_pos	float32

⚙Key Parameters

Parameter	Value	Stage	Meaning
`STATE_HZ`	30	capture	state + image sampling rate (shared in sim_sync)
episodes	50	capture	5 positions cycled in blocks of RUNS_PER_POSITION (=2) → 10 per XY
`--action-mode`	next_state	export	action[t] = observed_state[t+1]
`--image-tolerance-ms`	50.0	export	image-vs-state gap above this is marked stale (effective in async/real)
fps	30	export	output timeline frequency
`chunk_size`	50	train/infer	predict 50 future action steps per inference
`peft.r`	32	train	LoRA rank (EXP-002 used 16)
`batch_size`	64→32→16	train	OOM downgrade; steps scale inversely
`steps`	100000	train	baseline at bs=64; scaled up when bs drops
`n_action_steps`	1 / 50	eval/deploy	sim=1, real=50; set in load_finetuned_policy()
`--port`	2310	eval	web viewer default port

A text version of the full flow with commands lives in the VLA project repo (PIPELINE.md); experiment records in EXPERIMENT_LOG.md.

⤳Pipeline Overview

①Data Generation — Simulation Capture

1.1 One episode, end to end

Orchestration — `_run_episode_skill()`

Keep / discard — `main()` loop

1.2 Sampling positions & diversity

1.3 sim_sync — atomic state/image alignment

Command taps (transparent proxies)

Synthetic timestamps — `_build_sim_hook()`

1.4 Output: raw_episodes/ layout

②Export & Convert — raw → LeRobot v3

2.1 30Hz timeline resampling

2.2 Two timestamp domains

2.3 Action label — next_state mode

Semantics

Terminal padding

2.4 Output: LeRobot v3 dataset layout

③Training — SmolVLA + LoRA

3.1 Core training command

3.2 OOM auto-downgrade (shell retry, not a trainer feature)

3.3 Three run environments

④Evaluation — closed-loop rollout

sim vs real inference

Artifacts

▤Data Structures & Schemas

`commands.parquet` · recorder.py:393

`observations.parquet` · recorder.py:403

⚙Key Parameters

▶Training Demonstrations

Pick — "pick up the red cube" (ep0040)

Place — "place the red cube on the black area" (ep0040)

▶Trained Model — Evaluation Results

Place task — `eval_place_10k` (step 10k, 5 seeds)

Pick task — SmolVLA finetuned (19 Apr) · can locate, no precise pick

⤳Pipeline Overview

①Data Generation — Simulation Capture

1.1 One episode, end to end

Orchestration — _run_episode_skill()

Keep / discard — main() loop

1.2 Sampling positions & diversity

1.3 sim_sync — atomic state/image alignment

Command taps (transparent proxies)

Synthetic timestamps — _build_sim_hook()

1.4 Output: raw_episodes/ layout

②Export & Convert — raw → LeRobot v3

2.1 30Hz timeline resampling

2.2 Two timestamp domains

2.3 Action label — next_state mode

Semantics

Terminal padding

2.4 Output: LeRobot v3 dataset layout

③Training — SmolVLA + LoRA

3.1 Core training command

3.2 OOM auto-downgrade (shell retry, not a trainer feature)

3.3 Three run environments

④Evaluation — closed-loop rollout

sim vs real inference

Artifacts

▤Data Structures & Schemas

commands.parquet · recorder.py:393

observations.parquet · recorder.py:403

⚙Key Parameters

⊕Related Work

RoboClaw Architecture

▶Training Demonstrations

Pick — "pick up the red cube" (ep0040)

Place — "place the red cube on the black area" (ep0040)

▶Trained Model — Evaluation Results

Place task — eval_place_10k (step 10k, 5 seeds)

Pick task — SmolVLA finetuned (19 Apr) · can locate, no precise pick

Orchestration — `_run_episode_skill()`

Keep / discard — `main()` loop

Synthetic timestamps — `_build_sim_hook()`

`commands.parquet` · recorder.py:393

`observations.parquet` · recorder.py:403

Place task — `eval_place_10k` (step 10k, 5 seeds)