Robotic manipulation in open-world settings demands not only the execution of tasks but also the ability to detect and learn from failures during execution. While recent advances in vision-language models (VLMs) and large language models (LLMs) have enhanced robots' spatial reasoning and problem-solving capabilities, these models often struggle to recognize and reason about failures, limiting their effectiveness in real-world applications. In this paper, we introduce AHA, an open-source VLM specifically designed to detect and reason about failures in robotic manipulation through natural language. By framing failure detection as a free-form reasoning task, AHA identifies failures and generates detailed explanations adaptable across various robots, tasks, and environments in both simulation and real-world scenarios. To fine-tune Aha, we developed FailGen, a scalable simulation framework that procedurally generates the AHA dataset — the first large-scale dataset of robotic failure trajectories—by perturbing successful demonstrations from the RLBench simulator. Despite being trained solely on the AHA dataset, AHA generalizes effectively to real-world failure datasets, different robotic systems, and unseen tasks. It surpasses the second-best model by 10.3% and exceeds the average performance of all six compared models—including five state-of-the-art VLMs and one model employing in-context learning—by 35.3% across multiple metrics and datasets. Moreover, we integrate AHA into three VLM/LLM-assisted manipulation frameworks. Its natural language failure feedback enhances error recovery and policy performance through methods such as improving reward functions with Eureka reflection, optimizing task and motion planning, and verifying sub-task success in zero-shot robotic manipulation. Our approach achieves an average task success rate 21.4% higher than GPT-4 models. Our contributions are threefold: (1) developing FailGen and curating the AHA dataset, enabling scalable procedural generation of failure demonstrations; (2) instruction-tuning AHA for advanced failure reasoning in manipulation tasks, outperforming existing models; and (3) integrating AHA into downstream robotic systems, demonstrating improved error correction and policy performance.
VLM Reward Function Generation (Eureka)
VLM Task-plan Generation (PRoC3S)
VLM Sub-task Verification (Manipulate Anything)
AHA improved VLM/LLM generation shown within code block.
Generated response shown within code block.