MIT’s Improbable AI Lab has developed a multimodal framework called Compositional Foundation Models for Hierarchical Planning (HiP) to assist robots in detailed, feasible planning for tasks like household chores. This innovation addresses the challenge of providing robots with the ability to efficiently plan and execute tasks that involve multiple steps, similar to daily chores for humans.
The HiP framework utilizes three different foundation models, each trained on different data modalities, to capture various aspects of the decision-making process. The three foundation models are: a large language model (LLM), a video diffusion model, and an egocentric action model. These models work in a hierarchical manner, allowing them to plan, reason, and adapt to new information in a transparent way.
The large language model (LLM) serves as the foundation for ideation, capturing symbolic information needed for abstract task planning. It leverages common sense knowledge obtained from the internet to break down complex objectives into sub-goals. For example, a task like “making a cup of tea” is decomposed into sub-goals such as “filling a pot with water” and “boiling the pot.”
The video diffusion model augments the planning process by providing geometric and physical information about the environment. Trained on internet footage, this model generates an observation trajectory plan, refining the initial plan outlined by the LLM. It enhances the robot’s understanding of the physical world, allowing it to adapt to changes and unforeseen circumstances.
At the top of the hierarchy is the egocentric action model, which uses a sequence of first-person images to infer actions based on the robot’s surroundings. This model helps the robot execute each sub-goal within the long-horizon plan by mapping out the space visible to the robot and determining the appropriate actions.
The HiP framework is designed to address the limitations of existing models that require paired vision, language, and action data, which can be challenging to obtain. HiP’s multimodal approach enables the integration of linguistic, physical, and environmental intelligence into the robot’s decision-making process.
The researchers conducted tests to evaluate HiP’s performance on three manipulation tasks, where it demonstrated superior adaptability and reasoning compared to comparable frameworks. The tasks involved stacking blocks of different colors, arranging objects in a box while ignoring others, and completing kitchen sub-goals like opening a microwave and turning on a light.
The team envisions that HiP could find applications in various domains, including household chores, construction, and manufacturing tasks. It represents a step towards creating more capable and adaptable robots that can navigate complex real-world scenarios.
The HiP framework’s ability to leverage pre-trained models from different data modalities showcases the potential for combining existing expertise to enhance robotic decision-making. The researchers believe that HiP could be extended to incorporate additional modalities, such as touch and sound, to further improve its planning capabilities.
While the current version of HiP used small-scale video models, the researchers acknowledge the potential for higher-quality video foundation models to enhance visual sequence prediction and robot action generation in the future. They also highlight the cost-effectiveness and transparency of HiP, making it a promising approach for robotic planning tasks.
Overall, the development of the HiP framework represents a significant contribution to advancing the field of robotics, particularly in the domain of long-horizon planning and decision-making. As robots become more integrated into various aspects of daily life, innovations like HiP pave the way for more intelligent, adaptive, and capable robotic systems.