Research Direction
The Trustworthy Learning under Uncertainty (TLU) team focuses on enabling the safe, reliable deployment of embodied foundation models in human environments. As robots move beyond controlled lab settings into homes, workplaces, and shared spaces, they must operate robustly under uncertainty, adapt to novel situations, and behave in ways that are predictable and safe for people around them.
Our research addresses the challenges that arise when large behavior models (LBMs) are deployed at scale. To properly quantify the degree of performance and robustness from noisy real-world experiments, we develop rigorous statistical evaluation frameworks that provide meaningful and reproducible experimental conclusions. To support safe operation, we study failure prediction and detection, with a focus on identifying out-of-distribution conditions that compromise downstream behavior.
Beyond detection, we design methods for mitigation and recovery, including safe fallback behaviors and mechanisms to return systems to nominal operation after failure. Finally, we develop (inter)active and continual learning approaches that use policy-aware uncertainty to efficiently adapt embodied models to new environments, tasks, and user preferences. Together, these efforts aim to make large-scale robotic deployment in human environments both trustworthy and adaptive.
Core Team
Chen Xu, Haruki Nishimura, Masha Itkina
Recent / Representative Papers
Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison
David Snyder, Apurva Badithela, Nikolai Matni, Anirudha Majumdar, Masha Itkina, Haruki Nishimura, George J. Pappas
Pre-print, 2026
A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation
Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, Andrew Beaulieu, Jose Barreiros
arXiv, 2026
https://arxiv.org/abs/2602.01067
Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning
Kevin Huang, Rosario Scalise, Cleah Winston, Ayush Agrawal, Yunchu Zhang, Rohan Baijal, Markus Grotz, Byron Boots, Benjamin Burchfiel, Masha Itkina, Paarth Shah, Abhishek Gupta
IEEE International Conference on Robotics and Automation (ICRA), 2026
https://arxiv.org/abs/2510.19495
SAFE: Multitask Failure Detection for Vision-Language-Action Models
Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, Florian Shkurti
Neural Information Processing Systems (NeurIPS), 2025
https://arxiv.org/abs/2506.09937
STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation
Hossein Goli, Michael Gimelfarb, Nathan Samuel de Lara, Haruki Nishimura, Masha Itkina, Florian Shkurti
Neural Information Processing Systems (NeurIPS), Spotlight, 2025
https://arxiv.org/abs/2505.20781
CUPID: Curating Data Your Robot Loves with Influence Functions
Christopher Agia, Rohan Sinha, Jingyun Yang, Rika Antonova, Marco Pavone, Haruki Nishimura, Masha Itkina, Jeannette Bohg
Conference on Robot Learning (CoRL), 2025
https://arxiv.org/abs/2506.19121
Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping
David Snyder, Asher J. Hancock, Apurva Badithela, Emma Dixon, Patrick Miller, Rares Andrei Ambrus, Anirudha Majumdar, Masha Itkina, Haruki Nishimura
Robotics: Science and Systems (RSS), 2025
https://arxiv.org/abs/2503.10966
Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies
Chen Xu, Tony Khuong Nguyen, Emma Dixon, Christopher Rodriguez, Patrick Miller, Robert Lee, Paarth Shah, Rares Ambrus, Haruki Nishimura, Masha Itkina
Robotics: Science and Systems (RSS), 2025
https://arxiv.org/abs/2503.08558
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
TRI LBM Team, Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, … Haruki Nishimura, … Chen Xu, … Russ Tedrake
arXiv, 2025
https://arxiv.org/abs/2507.05331
GHIL-Glue: Hierarchical Control with Filtered Subgoal Images
Kyle B. Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair, Seohong Park, Blake Wulfe, Masha Itkina, Benjamin Eysenbach, Sergey Levine, Thomas Kollar, Benjamin Burchfiel
IEEE International Conference on Robotics and Automation (ICRA), 2025
https://arxiv.org/abs/2410.20018
How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation
Joseph A. Vincent, Haruki Nishimura, Masha Itkina, Paarth Shah, Mac Schwager, Thomas Kollar
IEEE Robotics and Automation Letters (RA-L), 2024
https://arxiv.org/abs/2405.05439
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, … Masha Itkina, … Sergey Levine, Chelsea Finn
Robotics: Science and Systems (RSS), 2024
https://arxiv.org/abs/2403.12945
Explore Until Confident: Efficient Exploration for Embodied Question Answering
Allen Z. Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, Dorsa Sadigh
Robotics: Science and Systems (RSS), 2024
https://arxiv.org/abs/2403.15941