Prototype note: multimodal teleoperation and fine-tuning on a home robotics setup

5 March 2026

This note documents a secondary prototype outside my main research agenda on multimodal retrieval and reasoning over video data. The goal was to test whether a low-cost arm and a desk robot could share visual inputs and execute a simple manipulation task.

The setup combined an SO-101 arm with Reachy Mini. The target task was to pick up a pen with the arm and present it to the robot.

SO-101 arm used in the prototype setup.

The first difficulty was perception. The arm required a wrist camera for local visual feedback and the robot provided its own front camera, which created a two-view multimodal input stream.

The second difficulty was data collection. Standard leader-arm teleoperation was not usable in this setup, so I built a small phone interface to control the follower arm directly.

Reachy Mini next to the Christmas tree — Reachy Mini Lite used with the SO-101 arm.

The phone interface exposed the wrist camera feed and two joysticks for joint control. This made it possible to collect task-specific demonstrations without a separate leader arm or a simulator.

Teleoperation app on phone: wrist camera feed and joystick controls for shoulder, elbow, wrist, and gripper — Phone teleoperation interface: wrist camera view and two joysticks for arm control.

I recorded about 50 teleoperated episodes of the arm picking up a pen and presenting it to Reachy. This produced a small task-specific dataset with synchronized control and visual observations.

I used these episodes to fine-tune SmolVLA. In this setup, full fine-tuning produced better training behavior than LoRA and completed in about 9 hours on a MacBook Pro M3.

Training loss and gradient norm: full finetuning vs LoRA r=16 — Training curves: full fine-tuning (fft) vs LoRA r=16, with lower loss for full fine-tuning.

The resulting policy executed the narrow pick-and-present behavior on the desk setup. The scope remains limited, but the prototype was useful for studying multimodal perception, teleoperation, and small-scale data collection.

Reachy and SO-101 acting together.

Code: https://github.com/YannisTevissen/my-home-bot