π0 VLA Real-Robot Reproduction
Reproduced the π0 vision-language-action model on a real robot arm from scratch in 3 months — built the arm, collected data, fine-tuned, and deployed end-to-end.

Background
π0 is a state-of-the-art vision-language-action model for robotic manipulation. I wanted to reproduce it on real hardware — not just run inference on a pre-trained checkpoint, but go through the full pipeline from hardware assembly to deployment.
What I Did
Built the robot arm and gripper setup from scratch, designed the scene, and collected 100+ teleoperation demonstrations. Built a LeRobot-format dataset, LoRA fine-tuned the π0 model, and deployed via TCP connection after Isaac Sim validation. Voice-to-text + camera input → ΔJoint output (6-axis + gripper).
Challenges
Learning VLA from zero while simultaneously building hardware in 3 months was the main challenge. Sim-to-real gap and data quality were the biggest technical hurdles — small errors in demonstration data compound quickly during policy rollout.
Takeaways
Proved that a single person can go from zero to a working VLA system in 3 months. The full pipeline — hardware, data, training, deployment — is now something I can iterate on quickly.
Gallery



