← Back to all articles
15 min

DeepFleet Is the Warehouse OS: How Amazon's Multi-Agent Foundation Model Changes Warehouse Economics

DeepFleetMAPFAmazonwarehousefoundation-model

Amazon's DeepFleet paper marks a turning point where robot coordination shifts from "solver performance" to "a compounding operational asset." For warehouse owners, this means the birth of a software layer that determines throughput per square meter, service levels, and the economics of buildings themselves.

Introduction: One Million Robots and the Intelligence on Top

In July 2025, Amazon announced that robots deployed across its global fulfillment centers (FCs) surpassed one million units. The milestone unit was deployed to a Japanese FC, joining a network spanning over 300 facilities worldwide.

But the real shock isn't the number. It's the generative AI model "DeepFleet" announced simultaneously, and the technical paper DEEPFLEET: Multi-Agent Foundation Models for Mobile Robots published on arXiv in August 2025.

Amazon's official announcement positions DeepFleet as an "intelligent traffic management system," claiming it improved robotic-fleet travel time by approximately 10%. The Japanese press release specifically cites "reduced energy usage" as a concrete outcome. Critically, Amazon explicitly states that this model "learns and improves over time."

This article thoroughly dissects the DeepFleet paper's technical content, translates what "10% travel time improvement" concretely means for warehouse economics using flow physics and Little's Law, and presents what warehouse owners should change across design, procurement, and contracts in the DeepFleet era.

What DeepFleet Actually Is — Not a "MAPF Solver" but a "Warehouse OS"

What Amazon Officially Claims

Amazon's official announcement contains two claims: (1) the robot network expanded to 1 million units across 300+ facilities, and (2) DeepFleet, a generative AI foundation model, improved fleet travel time by approximately 10%.

Notably, Amazon explicitly identifies "congestion and travel-time efficiency" as the KPI that DeepFleet moves. This means Amazon itself recognizes that fleet-wide traffic quality, not individual robot performance, is the scalable lever.

Furthermore, Amazon mentions business levers directly relevant to owners: DeepFleet enables "storing more products closer to customers" (network design reconfiguration) and "faster delivery and cost reduction." This suggests that robot traffic optimization extends beyond floor movement into inventory placement and network design.

What the Paper Technically Defines

The DEEPFLEET paper defines the model as "a suite of multi-agent foundation models trained on position, destination, and interaction data from hundreds of thousands of robots."

Two design choices are particularly important for practitioners.

First, the warehouse is modeled as a graph. The paper formalizes the warehouse layout as a directed graph G=(V,E)G = (V, E), where vertices VV represent discrete locations and edges EE represent admissible moves. Robots traverse this graph to execute tasks.

This means layout design, one-way constraints, chokepoints, and staging design become "first-class objects" in the model — not afterthought parameters, but the model's input structure itself.

Second, the learning target is "forecasting with operational intent." The paper positions multi-agent forecasting as a pretraining objective, with learned representations serving as foundations for downstream tasks such as congestion forecasting, adaptive routing, and proactive rescheduling.

Why "OS" Rather Than "MAPF Solver"

Multi-Agent Path Finding (MAPF) is defined as simultaneously planning collision-free paths for multiple agents. Search-based solvers like CBS (Conflict-Based Search), PIBT (Priority Inheritance with Backtracking), and LaCAM have advanced this field.

The DeepFleet paper explicitly positions itself in this MAPF context but pursues a fundamentally different center of gravity: foundation modeling using massive real-operational datasets, with forecasting as the core pretraining task and downstream optimization as the application domain.

Amazon's own Science explanation makes the strategic reason explicit: simulating interactions of thousands of robots faster than real time is "prohibitively resource intensive," whereas a learned model can infer traffic patterns quickly. They frame location prediction as pretraining analogous to next-word prediction enabling general language competence.

For warehouse owners, this distinction fundamentally changes the nature of the investment.

In classic deployments, "coordination quality" was bounded by vendor solver quality + site-specific tuning cycles. In a foundation-model deployment, "coordination quality" becomes a function of data volume, data diversity, and the training-deployment iteration loop — which can compound over time.

Four Architectures Reveal "What Actually Works"

Systematic Design Space Exploration

DeepFleet isn't one model but four architectures with deliberately different inductive biases, designed and compared systematically.

Robot-Centric (RC) Model — 97M Parameters

Event-based, asynchronous. Predicts each robot's next action from an ego-centric view with local context (nearest 30 robots, 100 markers, 100 objects). Coordinates are normalized for translational and rotational invariance. Trained on approximately 5 million robot-hours.

The operationally critical element is the rollout rule: when applying predicted actions, each robot reserves the set of vertices needed for movement. If needed vertices are already reserved, the robot switches to a wait action.

With shared weights, computation scales linearly with fleet size.

Robot-Floor (RF) Model — 840M Parameters

Fixed time-interval, synchronous. Cross-attention between robot tokens and floor tokens allows each robot to "see" the entire floor. Trained on approximately 700,000 robot-hours.

Image-Floor (IF) Model — 900M Parameters

Fixed time-interval, synchronous. Represents the warehouse floor as a multi-channel image, using CNN and Transformer video prediction to forecast the next frame (floor state). Trained on approximately 3 million robot-hours.

Graph-Floor (GF) Model — 13M Parameters

Event-based, asynchronous. Represents the warehouse as a spatiotemporal graph with message passing and edge-conditioned self-attention. Inference uses deterministic floor dynamics updates and collision arbitration (when two robots claim the same vertex, the higher-confidence claim wins; the other rolls back). Trained on approximately 2 million robot-hours.

From an owner's perspective, GF's significance is that "topology is the product." Layouts creating chokepoints and brittle graph structures will systematically limit achievable performance regardless of robot count.

Performance Comparison: Parameter Efficiency Tells the Story

The paper evaluates 60-second rollout predictions on test data from 7 warehouse floors not used in training, over 7 days.

ModelParametersDTW Position (m)DTW StateDTW TimingCDE Congestion (%)
RC (Robot-Centric)97M8.680.1114.913.40
RF (Robot-Floor)840M16.110.236.539.60
IF (Image-Floor)900M25.021.5848.29186.56
GF (Graph-Floor)13M10.750.7521.3514.22

DTW (Dynamic Time Warping) distance measures the average distance between predicted and actual trajectories after optimal time alignment. Position DTW units are in meters.

CDE (Congestion Delay Error) is a paper-specific metric of particular practical importance. It measures relative error in the proportion of time robots are delayed by others:

CDE=ttotaltfree-flowttotal\text{CDE} = \frac{t_{\text{total}} - t_{\text{free-flow}}}{t_{\text{total}}}

Where ttotalt_{\text{total}} is actual travel time and tfree-flowt_{\text{free-flow}} is a counterfactual "travel time if no other robots existed."

Four Critical Findings

Finding 1: Propagating local interactions beats providing global context

The 97M-parameter RC dominates the 840M and 900M models. Providing complete spatial context to every robot (RF/IF style) is an inefficient use of parameters.

Finding 2: Image-based representations are unsuitable for robot coordination

The 900M-parameter image model produced a catastrophic CDE of 186.56%. Convolutions treating each robot as a single pixel fail to capture robot interactions at the pixel level.

Finding 3: Graph structure yields remarkable parameter efficiency

A mere 13M-parameter GF model achieves competitive performance behind RC. Directly encoding the warehouse's physical topology drastically reduces what needs to be learned.

Finding 4: Event-based × action prediction × deterministic arbitration is the winning pattern

The two winning models (RC, GF) share three design choices:

  1. Event-based asynchronous updates (real robots don't synchronize)
  2. Predicting "next action" and updating state with a deterministic environment model (guaranteeing physical consistency)
  3. Vertex reservation or confidence-based collision arbitration (bridging learned policy and safe execution)

Scaling Laws: "Improving with Data" Is Not Wishful Thinking

The paper experimentally demonstrates that scaling laws hold for robotics too.

For GF, clear power laws over two orders of magnitude are derived from isoFLOP curves, predicting that at 102210^{22} FLOPs compute budget, the optimal configuration is approximately 1 billion parameters trained on approximately 6.6 million floor episodes.

This is the technical backbone behind Amazon's public message that the system "learns and improves over time." It's not wishful thinking — it's a prediction backed by power laws.

Translating "10% Travel Time Improvement" into Warehouse Economics

The Right Mental Model: Flow Physics, Not Robot Count

The biggest misconception warehouse owners fall into is assuming "more robots = linearly more throughput."

Reality says otherwise. In real automated material-handling systems, adding vehicles doesn't guarantee linear throughput improvement; the system can saturate or even deteriorate under interference and congestion.

The DeepFleet paper explains this phenomenon: "coupling among hundreds of agents produces emergent phenomena — congestion, deadlocks, traffic waves — that delay robot missions." The "traffic layer" is a primary determinant of the saturation point.

Economic Translation via Little's Law

Translating "10% travel time improvement" into concrete business numbers doesn't require sophisticated financial models. Little's Law from queueing theory suffices:

L=λ×WL = \lambda \times W

In steady state, average items-in-system (LL) equals throughput (λ\lambda) times average time-in-system (WW).

For warehouse owners: With constant WIP, reducing effective travel/congestion delay (a component of WW) supports higher throughput (λ\lambda) with the same WIP. Conversely, the same throughput can be maintained with fewer robots and less floor space.

Thinking in Concrete Numbers

In warehouse operations, order picking accounts for up to 55% of total operating costs, and travel accounts for more than half of picking time.

For a warehouse operating 1,000 robots:

ItemValue
Daily operating cost per robot$35
Monthly operating cost~$1M
Same-throughput scenario: 10% reductionEquivalent to 100 robots (~$1.3M/year savings)
Same-fleet scenario: 10% reductionPeak capacity increase → service level improvement

At Amazon's scale, the impact is orders of magnitude larger. The 10% improvement across 1 million robots must be understood in the context of Amazon's projected $12.6 billion in automation savings from 2025-2027.

Technical Deep Dive — Design Requirements for Production-Grade Coordination

The "Safety Bridge": Running Learned Policies in the Real World

Often overlooked but operationally critical in the DeepFleet paper: the deterministic arbitration layer.

  • RC Model: Each robot reserves vertices along its predicted movement path. Robots attempting to access already-reserved vertices switch to waiting
  • GF Model: When two robots claim the same destination vertex, the higher-confidence claim wins and the other is rolled back

No matter how accurate the foundation model's predictions, without collision safety guarantees, real warehouses cannot adopt it. DeepFleet's design explicitly incorporates this bridge.

Asynchronous vs Synchronous: Real Robots Don't Synchronize

Top-performing RC and GF use event-based asynchronous updates; lower-performing RF and IF use fixed time-interval synchronous snapshots.

This isn't coincidental. In real warehouses, some robots are moving while others are loading/unloading. Not all robots take their next step simultaneously.

Action Prediction + Environment Model: Correct Causal Decomposition

The "predict action, update state with deterministic environment model" approach used by RC and GF has two clear advantages over IF's "directly predict floor state" approach:

  1. Simplified learning target: Robot actions are discrete with clear constraints; resulting state changes can be deterministically computed from physics
  2. Long-horizon rollout stability: State prediction accumulates prediction errors into non-physical states; action prediction lets the environment model always guarantee physical consistency

Graph Model's "Topology Is the Product" Message

The fact that the GF model achieves competitive performance with just 13M parameters is a direct message to warehouse owners:

The warehouse's graph structure (topology) itself defines the ceiling of achievable performance.

Layouts creating chokepoints, brittle one-way constraints, and staging design errors structurally suppress coordination performance ceilings regardless of robot count or AI sophistication.

What Warehouse Owners Should Demand in the DeepFleet Era

Treat "Robot Traffic" as Infrastructure, Not a Feature

Requirement 1: A standard congestion metric separating "free-flow time" from "interference time"

Without separating "time robots move unimpeded" from "time lost to congestion and interference," the improvement lever can't be measured and ROI can't be evaluated.

Requirement 2: Demonstrated handling of emergent congestion at target density

Emergent congestion patterns and deadlocks that only manifest at higher robot densities are the primary failure modes causing manual resets and throughput loss. Demand stress test results from vendors at target operational density.

Requirement 3: A credible update loop

Amazon explicitly emphasizes "learning and improving over time." "Set-and-forget traffic rules" cannot remain competitive on the same axis.

Shift Design and CapEx Priorities to "Data Quality" and "Topology Quality"

Topology Quality: Chokepoints, staging design, and one-way constraints are graph constraints that shape congestion formation. Conduct graph analysis (identifying bottleneck nodes, evaluating alternative path redundancy) at the design stage.

Data Quality: Before investing in robot hardware, invest in infrastructure that structurally accumulates robot movement data — this is the CapEx priority of the DeepFleet era.

Evolve Contracts from Static SLAs to Improvement Clauses

  • Baseline + measured improvement cadence: Build quarterly "congestion-delay reduction targets" into contract terms
  • Clear fault domain definitions: Define who owns the delay between "traffic coordination layer," "WMS/WES," and "robot OEM controllers"

Japan Market: Special Conditions and Opportunity

  • Deepening labor shortage: Post-"2024 Problem," logistics labor shortages are worsening. Throughput improvement through existing fleet traffic quality is one of the few means of capacity expansion without additional headcount
  • High-density operations: Japanese warehouses have higher processing density per area than Western counterparts, meaning structurally higher congestion and deadlock probability — and larger improvement potential
  • Multi-vendor reality: Mixed-vendor robot environments are common, maximizing the value of vendor-agnostic coordination layers

What Is Proven and What Remains Unverified

Publicly Proven

  • Architecture comparison insights: Systematic comparison of 4 architectures demonstrating that local interaction structure + event-based updates + action prediction is most effective
  • Scaling behavior: Power-law scaling experimentally confirmed over 2 orders of magnitude for GF
  • Forecasting accuracy: DTW and CDE metrics on 60-second rollouts across 7 unseen warehouse floors

Publicly Unproven

  • Closed-loop control benchmark: No quantitative evaluation of the full "predict → control → execute → improve" feedback loop on real facilities
  • Causal decomposition of "10% improvement": No detailed breakdown of which DeepFleet elements contribute what portion
  • Generalization limits: Performance on non-Amazon layouts and non-Amazon robot types is untested
  • Long-term scaling effects: No published data showing improvement curves over time to back "improves over time" claims

Rovnou: Building "Air Traffic Control" for Warehouse Robot Fleets

The Market Gap DeepFleet Illuminates

Amazon builds foundation models from 1 million robots and hundreds of warehouses to optimize its own fleet. But roughly 80% of the world's warehouses have no automation at all, and the vast majority of automated warehouses aren't Amazon's.

Warehouses with mixed-vendor robots, insufficient operational data, and no AI specialists — these are the warehouses that need vendor-agnostic coordination layers most.

Rovnou is building "air traffic control for warehouse robot fleets." With MAPF algorithms at our core, we deliver deadlock prevention, throughput optimization, and vendor-agnostic robot coordination.

Design Principles Learned from DeepFleet

  • Local information propagation > direct global context: Robots deciding from neighborhood situations and propagating information is more efficient
  • Event-driven > fixed timestep: Naturally representing asynchronous robot movement
  • Action prediction + deterministic environment model: Balancing prediction stability and physical consistency
  • First-class graph structure utilization: Directly encoding topology into the model
  • Explicit safety bridge design: Bridging the gap between learned policies and collision avoidance with deterministic arbitration

Conclusion: "Robot Traffic" Is the New Warehouse Infrastructure

The most fundamental message from the DeepFleet paper, in one sentence:

"It is not the number or type of robots, but the traffic quality between them that determines a warehouse's effective capacity. And that traffic quality is a software layer that can be continuously improved with the right design and data."

Actions you can take today:

  1. Measure current free-flow time vs interference time (establish your baseline CDE-equivalent metric)
  2. Conduct graph analysis of your layout (visualize chokepoints and path redundancy)
  3. Build your telemetry infrastructure (data foundation for future learning-based coordination)
  4. Incorporate improvement clauses into vendor contracts (shift from static SLAs to dynamic improvement cadence)

DeepFleet heralds the dawn of the "foundation model era" in warehouse robotics. The first to benefit will be warehouse owners who start preparing today.