HKU Breakthrough: VR-Driven Robot Achieves Human-Like Motion Learning
VR2026-02-15 09:15:10


This groundbreaking research, jointly conducted by the University of Hong Kong, Shanghai Innovation Research Institute, Beihang University, and Kinetix AI, was published in February 2026, with the arXiv paper ID arXiv:2602.10106v1. Interested readers can access the full paper through this ID.


Imagine this scene: you are doing housework at home while wearing VR glasses, throwing away garbage, organizing items, and carrying things. These ordinary actions, once recorded, can actually teach a 1.3-meter-tall robot to do the same things in a completely different environment. This sounds like a plot from a science fiction movie, but a research team from the University of Hong Kong has really achieved it.


This project, named "EgoHumanoid", is the first to achieve the training of humanoid robots for complex whole-body motion control using first-person videos of humans. Just as human infants learn to walk by observing adults walk, robots can now learn how to walk and manipulate objects in the real world by "watching" videos of humans.


The traditional robot training method is akin to a student solely studying in a classroom, never encountering the real world outside. Researchers typically need to use expensive and complex remote control equipment in the laboratory to "teach" the robot every action step-by-step. This approach is not only costly, but the skills the robot acquires are often only applicable in the monotonous environment of the laboratory. Once in real-life homes, stores, or outdoor settings, the robot enters a completely unfamiliar world and often behaves clumsily.


However, human daily life is exactly the opposite. We walk, fetch objects, and carry things in various environments every day, accumulating rich experience. The problem is that the physical structures of humans and robots are quite different: the average height of humans is 1.6 to 1.8 meters, while the experimental Unitree G1 robot is only 1.3 meters tall; humans have flexible fingers, while robots only have simple three-fingered mechanical hands; when humans walk, their bodies naturally sway, while robots need to maintain mechanical balance. This is like trying to put an adult's clothes directly on a child's body, where the size and proportion do not match.


The ingenuity of the research team lies in the development of a "translation system" that can "translate" human actions into commands that robots can understand and execute. This process involves two key steps: viewpoint alignment and action alignment.


Perspective alignment is like equipping a robot with a pair of "zoom glasses". Since humans are taller than robots, they see from different perspectives, just like an adult and a child looking at the same table, where the adult may be looking down and the child may be looking at eye level. The research team used a technique called MoGe to estimate the distance information of each pixel in the video, and then "compressed" the high perspective of humans to the low perspective of the robot. When this conversion produced blank areas, they used artificial intelligence image generation technology to "fantasy" these missing parts, ensuring that the robot sees a complete picture.


Action alignment is more like creating a "universal action dictionary". The research team designed a set of action languages that both humans and robots can "speak". For the upper body's operational actions, they describe them using relative position changes, such as "extend the hand forward by 5 centimeters and turn left by 15 degrees", thus avoiding the mismatch of absolute positions caused by height differences. For the lower body's walking actions, they simplify the complex gait into a few basic commands: forward, backward, left turn, right turn, squat, stand, etc., which are as simple and clear as the directional buttons on a gamepad.


To collect training data, the research team developed a portable VR equipment system. Human volunteers wear VR headsets and body trackers, with cameras mounted on the headsets recording first-person perspective videos and body trackers capturing full-body movements. This equipment is lightweight and can be taken anywhere, unlike traditional robotic remote control systems that are bulky and complex. Volunteers can naturally perform various tasks in various real-world environments such as homes, shops, and parks, and all this data is automatically recorded.


In comparison, remote-controlled training for robots is akin to taking a "standardized test" in a laboratory. Operators must precisely control every joint of the robot while wearing intricate remote control equipment. This not only demands high technical proficiency but can only be conducted in a laboratory setting equipped with specialized equipment. Statistics indicate that collecting a human demonstration video takes an average of 39.7 seconds, whereas gathering remote control data for a robot of the same duration requires 62.1 seconds, signifying a nearly doubled efficiency in human data collection.


The research team designed four test tasks to verify the effectiveness of this system. These tasks require the robot to possess both walking and manipulation abilities, just like humans need to walk and do things at the same time in daily life.


The first task is "Pillow Placement". The robot needs to carry the pillow to the bedside, squat down, and place the pillow in the designated position at the head of the bed. This task tests whether the robot can walk with balance while carrying an item, and accurately place the item on a soft bed surface.


The second task is "waste disposal". The robot needs to carry the waste to the trash can and accurately dispose of it into the opening. This is not a simple act of throwing from above, but requires disposing from the side, which demands precise spatial positioning and throwing skills from the robot.


The third task is "toy transfer". The robot needs to walk to a table, grab the toy with both hands, and then turn around to walk to another table to put the toy down. This task involves a continuous sequence of actions: approaching, grabbing, carrying, and placing, with each step requiring precision.


The fourth task is "shopping cart organization", which is the most complex one. The robot needs to push the shopping cart to the shelf, hold the cart with one hand to maintain stability, use the other hand to take toys off the shelf and put them into the cart, and finally push the cart away. This task requires the robot to have multitask coordination ability.


The experimental results were surprising. In the familiar laboratory environment, the system trained solely on robot teleoperation data achieved an average success rate of 59%, while the success rate increased to 78% after incorporating human demonstration data. However, the real breakthrough occurred during testing in unfamiliar environments: the success rate of the system trained purely on robot data was only 31%, whereas the system incorporating human data achieved a remarkable 82%, representing an astonishing 51% improvement.


What does this imply? Just like comparing a student who only studies in school with one who not only studies in school but also has rich life experience, the latter is more adaptable when facing new situations. Human daily experience provides robots with abundant "common sense of life", enabling them to better handle various unexpected situations.


Further analysis reveals an interesting phenomenon: different types of skills benefit differently from human data. Navigation skills (such as walking, turning, and positioning) can be almost entirely learned from human data, as the basic principles of spatial movement are similar for humans and robots. However, the transfer effect of fine manipulation skills (such as precise grasping and rotating objects) is poor, as human fingers are far more dexterous than robotic hands.


The research team also discovered that the diversity of human data is more crucial than its quantity. They conducted a comparative experiment: using the same amount of human demonstration data, but collecting it in 1, 2, and 3 different scenarios respectively. The results indicated that even with the same total data volume, the more diverse the scenarios, the stronger the robot's generalization ability. This is akin to learning a language; encountering the same vocabulary in different contexts is more conducive to comprehension than repeatedly hearing the same vocabulary in the same environment.


Of course, this system has its limitations. The primary issue is that precise conversion of hand movements remains challenging. Due to the significant structural differences between human and robotic hands, it is difficult for robots to accurately understand the precise rotational movements humans intend to perform. Additionally, this training method demands high data quality, requiring human demonstrators to maintain relatively standard movements, such as not obscuring their hands for too long or swaying their bodies excessively.


Looking ahead, the potential applications of this technology are vast. Household service robots may no longer need to be individually programmed for each new household, but can learn to adapt to new environments by watching videos of their owners' daily lives. Industrial robots may also quickly learn new assembly processes by watching videos of skilled workers operating. More interestingly, with the popularity of VR and AR devices, ordinary people's daily activities themselves may become valuable resources for robot learning.


The true significance of this research lies in opening up a brand-new path for robot training. Previous robot learning was akin to the traditional apprenticeship model, where every move had to be taught step-by-step. However, nowadays, robots have begun to possess the ability to learn through observation, just like human infants acquire basic skills by observing adults. Although this type of learning is not yet perfect, it represents a significant milestone in the development of robot intelligence.


With the continuous improvement of this technology, we may indeed usher in an era where robots can learn new skills just by being demonstrated once with VR glasses. At that time, training robots may become as simple as recording a video tutorial.


Q&A


Q1: How does the EgoHumanoid system specifically enable robots to learn human actions?


A: EgoHumanoid is achieved through two steps: first, perspective alignment, which utilizes AI technology to convert the high perspective of humans into the low perspective of robots; then, action alignment, which transforms complex human actions into simple instructions that robots can understand. It's like creating a universal action dictionary for humans and robots, allowing robots to "translate" human demonstrative actions.


Q2: What are the advantages of using VR glasses to train robots compared to traditional methods?


A: The greatest advantage of VR glasses training lies in its portability and efficiency. Traditional methods require complex equipment in a laboratory to remotely control robots, whereas VR systems can collect data anywhere, nearly doubling efficiency. More importantly, humans can naturally demonstrate in diverse real-world environments, providing robots with rich "life experiences" and significantly enhancing their adaptability in new environments.


Q3: What is the success rate of this training method?


A: In the test of unfamiliar environments, the success rate of pure robot training is only 31%, while the system combined with human demonstration data achieves a success rate of 82%, an improvement of 51%. In familiar environments, the success rate also increases from 59% to 78%. This indicates that human daily experience can indeed significantly enhance the learning effectiveness and adaptability of robots.

Copyright:Chongqing Meixin Investment Co., Ltd 渝ICP备18007683号 Technical Support:Chongqing Lianlian Network Technology Co., Ltd