Physical Intelligence Unveils π0.7: An AI Model That Lets Robots Perform Tasks They Were Never Trained On

AI3 min

April 21, 2026

Physical Intelligence Unveils π0.7: An AI Model That Lets Robots Perform Tasks They Were Never Trained On

Physical Intelligence, a startup founded by former Google engineers, released π0.7 — a Vision-Language-Action model that demonstrates compositional generalization, enabling robots to combine learned skills for untrained tasks.

📝

CoinJP Editorial

CoinJP Editorial · 0 articles

Physical Intelligence, a startup founded by former Google engineers, has released its latest model π0.7. The team claims the system represents a "qualitative leap" in AI's ability to generalize skills, allowing robots to tackle tasks they were never explicitly trained to perform.

«Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!» — Physical Intelligence (@physical_int), original post

Why This Matters

Historically, AI models in robotics have required fine-tuning for every new task and every new robot platform — much like early language models needed task-specific training. Each new scenario demanded fresh datasets and separate training runs. The π0.7 model aims to break this pattern by working out of the box and adapting to unfamiliar situations through text and visual prompts. If the claimed capabilities hold up in real-world deployments, this could mark a meaningful step toward general-purpose robots that aren't locked into narrow use cases.

Compositional Generalization and Cross-Robot Transfer

The π0.7 model belongs to the Vision-Language-Action (VLA) class, combining visual perception, natural language understanding, and physical action generation to control robots. Its standout feature is compositional generalization — the ability to combine previously learned skills to solve entirely new problems.

During testing, π0.7 exhibited several unexpected capabilities. The model successfully controlled a UR5e robot to fold t-shirts, despite the training data containing no laundry-folding examples for that particular platform.

«Compositional generalization is a key capability of large models like LLMs, but it has been elusive in robotics. Another emergent ability we found is to control a new robot (UR5e) to fold t-shirts, even though we didn't have any laundry folding data on this robot.» — Physical Intelligence (@physical_int), original post

The developers noted that performance was comparable to operators with hundreds of hours of teleoperation experience. Beyond that, the robot managed to partially complete a task involving cooking sweet potato in an air fryer — a scenario entirely absent from the training set. According to the team, this was made possible by merging disparate skills, analogous to how large language models combine knowledge across domains.

Multimodal Control: Not Just "What" But "How"

A key innovation in π0.7 is its expanded control interface. The model accepts not only commands specifying what to do, but also guidance on how to do it. Three types of input are supported:

Natural language text instructions;
Metadata such as execution speed and quality parameters;
Visual subgoals — images depicting the expected outcome at each step.

«π0.7 handles diverse prompts that don't just say what to do, but also how to do it, including rich language and multimodal information, such as visual subgoal images. At test time, these images can be produced by a lightweight world model.» — Physical Intelligence (@physical_int), original post

Some visual subgoals are generated by an auxiliary system during runtime, allowing the robot to adjust its behavior on the fly without retraining. The architecture also unifies data from multiple sources — video recordings, robot telemetry, and autonomously collected episodes — into a single training pipeline.

Early Signs of Universal Robots

Physical Intelligence emphasized that compositional generalization has long been considered a strength exclusive to large language models and remained out of reach in robotics. The π0.7 model aims to change that by functioning immediately upon deployment and adapting through language-based prompts.

The developers acknowledge limitations: without step-by-step instructions, the model doesn't always succeed at complex tasks. However, when given sequential guidance, execution quality improves significantly. In the future, this approach could form the foundation for training more autonomous machines capable of operating without constant human oversight.

Physical Intelligence views π0.7 as showing the first signs of a transition toward universal robots — systems that adapt to new environments without manual configuration for each individual task.

artificial-intelligence googlemachine-learningphysical-intelligenceroboticsvla-models

Frequently Asked Questions

What is Physical Intelligence π0.7?

π0.7 is a Vision-Language-Action (VLA) AI model designed to control robots. Its key feature is compositional generalization — the ability to combine previously learned skills to perform tasks the system was never directly trained on.

Who founded Physical Intelligence?

Physical Intelligence was founded by former Google engineers. The startup focuses on developing AI models for robotics applications.

Can π0.7 control robots it wasn't trained on?

Yes, during testing π0.7 successfully controlled a UR5e robot to fold t-shirts despite having no laundry-folding data for that specific platform. This cross-robot transfer is one of the model's emergent capabilities.

What is compositional generalization in robotics?

Compositional generalization refers to a model's ability to combine previously learned skills to solve new, unfamiliar tasks. This capability was long considered exclusive to large language models and had remained elusive in the robotics domain.

How does π0.7 accept instructions?

The model accepts three types of input: natural language text instructions, metadata like speed and quality parameters, and visual subgoal images showing expected outcomes at each step. Some subgoals can be generated automatically at runtime by a lightweight world model.

Read also

Alphabet Posts $94.7B Q1 Revenue Beating Estimates Amid AI-Driven Growth

Google's parent company Alphabet reported Q1 2026 revenue of $94.7 billion, surpassing Wall Street forecasts, with its cloud division and AI integration fueling a strong beat across all metrics.

3 min·🔥 0

Google Launches Nano Banana 2 Image Model and Redesigned Flow Creative Studio

Google released Nano Banana 2, a new visual generation model delivering Pro-level quality at Gemini Flash speed, alongside a major overhaul of its Flow creative platform.

3 min·🔥 1

AI Audit Uncovers Critical Liveness Bug in Ethereum's Nethermind Client

Octane Security's AI discovered a high-severity vulnerability in the Nethermind execution client that could have halted block production for 38% of Ethereum mainnet validators. The Ethereum Foundation awarded a maximum $50,000 bounty.

3 min·🔥 1

Innovations

Google Enhances Opal AI Platform with New Autonomous Agents

Google has upgraded its visual AI workflow builder Opal with agent functionality that automatically analyzes tasks and selects appropriate tools for completion.

3 min·🔥 1

OpenAI Secures Record $110 Billion Round at $730 Billion Valuation

OpenAI closed the largest startup funding round in history at $110 billion, backed by Amazon, SoftBank, and Nvidia, with a $730 billion valuation.

4 min·🔥 1

Business

Google: Breaking Bitcoin Requires 20x Fewer Qubits Than Previously Estimated

Google researchers found that fewer than 500,000 physical qubits could be enough to crack Bitcoin and Ethereum's cryptographic defenses — a 20-fold reduction from prior estimates.

3 min·🔥 0