Skip to main content

Research Direction

The Trustworthy Learning under Uncertainty (TLU) team focuses on enabling the safe, reliable deployment of embodied foundation models in human environments. As robots move beyond controlled lab settings into homes, workplaces, and shared spaces, they must operate robustly under uncertainty, adapt to novel situations, and behave in ways that are predictable and safe for people around them.

Our research addresses the challenges that arise when large behavior models (LBMs) are deployed at scale. To properly quantify the degree of performance and robustness from noisy real-world experiments, we develop rigorous statistical evaluation frameworks that provide meaningful and reproducible experimental conclusions. To support safe operation, we study failure prediction and detection, with a focus on identifying out-of-distribution conditions that compromise downstream behavior.

Beyond detection, we design methods for mitigation and recovery, including safe fallback behaviors and mechanisms to return systems to nominal operation after failure. Finally, we develop (inter)active and continual learning approaches that use policy-aware uncertainty to efficiently adapt embodied models to new environments, tasks, and user preferences. Together, these efforts aim to make large-scale robotic deployment in human environments both trustworthy and adaptive.

Core Team

Chen Xu, Haruki Nishimura, Masha Itkina

Recent / Representative Papers

Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

David Snyder, Apurva Badithela, Nikolai Matni, Anirudha Majumdar, Masha Itkina, Haruki Nishimura, George J. Pappas
Pre-print, 2026
 

A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, Andrew Beaulieu, Jose Barreiros
arXiv, 2026
https://arxiv.org/abs/2602.01067
 

Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning

Kevin Huang, Rosario Scalise, Cleah Winston, Ayush Agrawal, Yunchu Zhang, Rohan Baijal, Markus Grotz, Byron Boots, Benjamin Burchfiel, Masha Itkina, Paarth Shah, Abhishek Gupta
IEEE International Conference on Robotics and Automation (ICRA), 2026
https://arxiv.org/abs/2510.19495
 

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki NishimuraMasha Itkina, Florian Shkurti
Neural Information Processing Systems (NeurIPS), 2025
https://arxiv.org/abs/2506.09937
 

STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

Hossein Goli, Michael Gimelfarb, Nathan Samuel de Lara, Haruki NishimuraMasha Itkina, Florian Shkurti
Neural Information Processing Systems (NeurIPS)Spotlight, 2025
https://arxiv.org/abs/2505.20781
 

CUPID: Curating Data Your Robot Loves with Influence Functions

Christopher Agia, Rohan Sinha, Jingyun Yang, Rika Antonova, Marco Pavone, Haruki NishimuraMasha Itkina, Jeannette Bohg
Conference on Robot Learning (CoRL), 2025
https://arxiv.org/abs/2506.19121 
 

Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping

David Snyder, Asher J. Hancock, Apurva Badithela, Emma Dixon, Patrick Miller, Rares Andrei Ambrus, Anirudha Majumdar, Masha Itkina, Haruki Nishimura
Robotics: Science and Systems (RSS), 2025
https://arxiv.org/abs/2503.10966
 

Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies

Chen Xu, Tony Khuong Nguyen, Emma Dixon, Christopher Rodriguez, Patrick Miller, Robert Lee, Paarth Shah, Rares Ambrus, Haruki NishimuraMasha Itkina
Robotics: Science and Systems (RSS), 2025
https://arxiv.org/abs/2503.08558
 

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

TRI LBM Team, Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, … Haruki Nishimura, … Chen Xu, … Russ Tedrake
arXiv, 2025
https://arxiv.org/abs/2507.05331
 

GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

Kyle B. Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair, Seohong Park, Blake Wulfe, Masha Itkina, Benjamin Eysenbach, Sergey Levine, Thomas Kollar, Benjamin Burchfiel
IEEE International Conference on Robotics and Automation (ICRA), 2025
https://arxiv.org/abs/2410.20018
 

How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation

Joseph A. Vincent, Haruki NishimuraMasha Itkina, Paarth Shah, Mac Schwager, Thomas Kollar
IEEE Robotics and Automation Letters (RA-L), 2024
https://arxiv.org/abs/2405.05439
 

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, … Masha Itkina, … Sergey Levine, Chelsea Finn
Robotics: Science and Systems (RSS), 2024
https://arxiv.org/abs/2403.12945
 

Explore Until Confident: Efficient Exploration for Embodied Question Answering

Allen Z. Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, Dorsa Sadigh
Robotics: Science and Systems (RSS), 2024
https://arxiv.org/abs/2403.15941