Experiences

Leveraging LLM with Active Imitation Learning of Hierarchical Reinforcement Learning using Emergent Symbolic Representation

Research in ENSTA, IP-Paris, France, 2025

Large Language Models (LLMs) exhibit their potential for interacting with reinforcement learning agents, the main challenge is to align the world model learned by the agent with a representation compatible with LLMs, these representations should be well structured and contain the whole information of the environment. Some hierarchical reinforcement learning (HRL) addresses this challenge by decomposing task and producing emergent symbolic representations of a long-horizon task. However, a central open question remains: how to effectively learn a representation of the environment that aligns with LLM? We introduce SGIM-STAR, a hybrid framework where the top-level agent choose actively between a Q-learning based Commander and an LLM-based planner using a partition-wise, progress-driven intrinsic rule. Both strategies in this framework use a symbolic representation of the space. Experiments demonstrate that SGIM-STAR improves stability over STAR, reduces reliance on costly LLM calls, and achieves higher long-horizon task success.

Recommended citation: Ma,Z., Nguyen,S.M., and Xu, P. (2025). Bridging Symbols from Language and Hierarchical Reinforcement Learning with Active Imitation. NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning.

Can LLMs Translate Human Instructions into a Reinforcement Learning Agent’s Internal Emergent Symbolic Representation?

Research in ENSTA, IP-Paris, France, 2024

Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs – GPT, Claude, Deepseek and Grok – across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.

Recommended citation: Ma, Z., Nguyen, S. M., and Xu P. (2025). Can LLMs Translate Human Instructions into a Reinforcement Learning Agent’s Internal Emergent Symbolic Representation? RO-MAN 2025 1st Workshop on Interactive Task Learning in Human-Robot co-construction (ITL4HRI).

Manipulate as Human: Learning Task-oriented Manipulation Skills by Adversarial Motion Priors

Research in AI Institute, Shanghai Jiao Tong University, China, 2023

In recent years, there has been growing interest in developing robots and autonomous systems that can interact with humans in a more natural and intuitive way. One of the key challenges in achieving this goal is to enable these systems to manipulate objects and tools in a manner that is similar to how humans do. In this paper, we propose a novel approach for learning human-style manipulation skills by using adversarial motion priors. Our approach leverages a deep neural network to model the complex dynamics of tool and object manipulation. The network is trained using a combination of real-world data and synthetic data generated by an adversarial network. The adversarial network is designed to generate realistic motion trajectories that match the statistical properties of human motion, which are then used to augment the training data for the manipulation network. We evaluate our approach on a manipulation task: hammering. Our results demonstrate that our approach is able to learn human-style manipulation skills that outperform state-of-the-art methods. Additionally, we show that our approach is able to generalize to novel objects and tasks, demonstrating its potential for real-world applications.

Recommended citation: Ma Z., Tian C, Gao Y.(2025) Manipulate as human: learning task-oriented manipulation skills by adversarial motion priors. ,Robotica, pp. 1–13. doi:10.1017/S0263574725001444.

Detection and Imitation of human movement by humanoid robot

Research in ENSTA Paris, France, 2022

The Keraal project aims to develop a physiotherapist robot called Poppy capable of “coaching” patients during their rehabilitation sessions. Our work is a part of the Keraal project to improve the capability of the Poppy robot. The objective of our work is to detect and imitate human movements by the Poppy robot. Firstly, we try the Blazepose library to detect human skeletons in the collected videos, and we compare the results with those made by the Kinect, Openpose and Vicon library and calculate the differences. We choose the Kinect library as the most suitable library for imitation work. After obtaining the human motions, we build a learning model to apply motion retargeting between two characters. We test the model on animation data from Mixamo, then analyze the performance and propose a method to improve.

Recommended citation: Annabi, L., Ma, Z., and Nguyen, S. M. (2024). Unsupervised Motion Retargeting for Human-Robot Imitation. Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction (587--591). Association for Computing Machinery (rank A conference)

Two Person Emotion Interaction Detection and Generation on Video Sequences

Research in SPEIT, Shanghai Jiao Tong, China, 2021

In the process of human-robot interaction, how to generate robots’ actions response corresponding to interaction scenarios has become a widespread problem. Predecessors’ work mainly focused on the recognition and classification of emotional actions, which seldom got involved in the state of emotion generation. In fact, the continuous expression of human emotions in high-dimensional space makes the emotion generated in the action dimension particularly complicated. Our paper combines emotion prediction with human body pose estimation, uses the component emotion model as a criterion for emotion quantification and constructs an emotional coordinate system of affiliation-dominance for estimating the emotional state of two-person interaction scene; then, based on the interactive sequential skeleton data and emotional tags, we establish sequential models and generative-adversarial models in order to predict a single-person action sequence adapted to the interactive state. We devote to apply this method to the emotion generation of humanoid robots so that to improve the human experience in the process of human-computer interaction. Sequential model and adversarial model that we built are: LSTM-single frame model, GRU-nine frame model and conditional GAN-five frame model. The experiment is conducted on the BoLD dataset containing twenty-six emotional labels and the SBU dataset containing eight interactive actions. The results show that when the dataset is small, sequential model have better performance than adversarial model.

Ziqi Ma

Experiences

Leveraging LLM with Active Imitation Learning of Hierarchical Reinforcement Learning using Emergent Symbolic Representation

Can LLMs Translate Human Instructions into a Reinforcement Learning Agent’s Internal Emergent Symbolic Representation?

Manipulate as Human: Learning Task-oriented Manipulation Skills by Adversarial Motion Priors

Detection and Imitation of human movement by humanoid robot

Two Person Emotion Interaction Detection and Generation on Video Sequences