The Last Mile Is the Job in the AI Era

Most conversations about AI agents focus on how much work the agent can do. Can it write the code, generate the deck, analyze the dataset, triage the support inbox, or automate the outreach sequence?

That framing is directionally useful, but it misses the deeper shift. The problem is not that AI agents are failing. The bigger change is that in the AI era, the highest-value part of many jobs increasingly lives in the last mile: judgment, correction, integration, accountability, and the ambition to raise the bar once basic execution becomes cheaper.

The most important question is not how much of the task an agent can start. It is whether the system can reliably finish the job.

In practice, many agents can already do 80 to 90 percent of a workflow. That is real progress. But the remaining 10 percent often contains a disproportionate amount of the value and risk. This is the stretch where someone still has to review the output, catch subtle mistakes, decide whether the work is actually good enough, fit it into a larger system, and take accountability for shipping it.

That final stretch is the last mile. And I increasingly think it is not just a leftover problem. It is becoming the job.

The dangerous illusion of 90 percent automation

A lot of AI demos look incredible because they show the part of the workflow that is easiest to appreciate visually. The agent reasons through a problem, opens tools, generates an answer, and appears to complete the task end to end. In a demo, this feels close enough to autonomy.

But production is not a demo.

A coding agent can write a lot of code and still leave the hardest part undone: code review, security review, production readiness, observability, rollback planning, and ownership after deployment.

A research agent can summarize papers and draft a literature review, but someone still has to verify whether the citations are real, whether the novelty claim holds up, whether the framing is honest, and whether the argument survives external scrutiny.

A sales agent can automate outreach and qualification, but closing the loop still requires customer judgment, trust building, negotiation, and context that rarely fits neatly into a prompt.

This is why partial automation gets overvalued. Doing 90 percent of a task is irrelevant if the remaining 10 percent determines whether the result can actually be used.

Why the last mile is the hardest part

The last mile is not just the leftover work. It is usually the highest-consequence work.

This is where judgment shows up. This is where taste matters. This is where the system encounters the edge cases that were invisible in the happy path. This is where a low-quality output can quietly become an expensive problem.

In other words, the last mile is where teams have to answer the questions that models are still bad at answering with confidence:

Is this actually correct?
Is it safe enough to ship?
Does it fit the broader context?
Did we miss a failure mode?
Is this good, or does it just look finished?

The people who can answer those questions create the real value. Their job shifts from manual execution toward adjudication: reviewing, correcting, integrating, escalating, and deciding when something is done.

That is also why so many arguments about AI replacing jobs miss the point. The issue is not whether a model can do most of the visible work. The issue is whether anyone can trust the system to complete the entire job at the level that reality demands.

Better AI does not eliminate the last mile. It creates a new one.

There is a common assumption hiding underneath a lot of automation discourse: once agents go from 80 percent to 95 percent to 99 percent task completion, the remaining human work will disappear.

I do not think that is how this plays out in real organizations.

When tools get better, expectations rise with them.

If an engineering team can produce more code, the bar for product quality, reliability, testing, and iteration speed goes up. If an analytics team can generate dashboards faster, the organization asks for deeper analysis, tighter decision loops, and more customized insights. If a design workflow gets partially automated, customers do not reward you for doing less design work. They expect better products.

This has been true across waves of software tooling. Better tools rarely make the work simpler in aggregate. They expand the frontier of what counts as acceptable work.

So today’s 99 percent solution often becomes tomorrow’s 50 percent solution, because the target moved.

That is why AI should not only be understood as an efficiency engine. It is also an ambition engine. When execution gets cheaper, the rational response is not just to defend the old scope of work. It is to ask what becomes possible now that used to feel out of reach.

AI raises the ambition frontier

This is where Garry Tan’s “boil the ocean” framing feels relevant.

In normal times, ambition gets constrained by labor, coordination cost, and execution bottlenecks. Teams are told not to boil the ocean because the ocean is simply too large for the available bandwidth.

But AI changes that equation.

When a team can explore ten product directions instead of two, inspect far more customer feedback, automate large parts of research and implementation, or compress weeks of execution into days, the right response is not to do the same work a little more cheaply. The right response is to raise ambition.

That is the deeper meaning behind the idea that today’s 99 percent solution becomes tomorrow’s 50 percent solution. Once the floor rises, the ceiling moves too. Organizations no longer compete on whether they can produce the old output more efficiently. They compete on whether they can use these new tools to build something significantly better.

This is the part that many cost-cutting narratives miss. AI does not just shrink the labor required for known work. It expands the set of projects, products, and standards that become rational to pursue. It pushes teams toward more ambitious builds, more advanced systems, richer customer experiences, and faster loops between idea and execution.

And that makes the last mile more important, not less. The more ambitious the system, the more important judgment becomes. Bigger scope, faster execution, and more powerful agents increase the need for strong review, strong governance, strong taste, and strong mechanisms for deciding what deserves to ship.

The strategic implication: the moat moves from generation to adjudication

If this is true, then the next wave of AI agent infrastructure will not be defined only by who has the most capable model or the longest autonomous workflow demo.

It will be defined by who is best at closing the gap between generated work and accepted outcomes.

That gap is where the real infrastructure lives:

verification systems that check whether outputs are trustworthy
routing systems that decide when to pass, retry, escalate, or stop
evidence capture that explains why the system believes an output is acceptable
audit trails that make the process legible after the fact
deployment and integration gates that turn artifacts into operational outcomes

This is why I keep coming back to harness engineering.

The harness is not just a wrapper around the model. It is the operational layer that determines whether an agent can be trusted in real work. It is where completion gets governed.

A strong model can generate a plausible answer. A strong harness determines whether that answer becomes production reality.

Why this matters even more for headless agents

This is especially important for headless agents in data and ML workflows.

A chat assistant has a natural escape hatch. If the answer looks suspicious, the human can ask a follow-up question, redirect the conversation, or simply ignore the output.

A headless agent does not have that luxury. It often operates against live systems, background jobs, production pipelines, schemas, dashboards, data contracts, or model evaluation workflows. There is no conversational cushion between generation and consequence.

That means the last mile gets sharper.

A data agent can produce a SQL query that looks plausible but is subtly wrong. An ML agent can recommend an evaluation change that seems reasonable but quietly breaks comparability across experiments. An analytics agent can generate an insight that sounds compelling but rests on a flawed join or missing cohort assumption.

We see the same pattern in ClaimMind. AI can help structure claim data, suggest coding directions, and reduce a large amount of manual review work, but the output still needs to be matched against hospital rules, payer constraints, and operational policy. The system can narrow the search space and surface likely ICD paths, but a human reviewer still makes the final decision on which claim interpretation and ICD code should actually be used. That final judgment is exactly where organizational accountability lives.

What is interesting is that this boundary is not fixed forever. As the system learns from organizational behavior, review patterns, escalation history, and accepted outcomes, more of that final layer may become automatable. But it only becomes safe to automate because the organization first created the review loop, the rules, and the evidence trail.

In all of those cases, the value of the system depends less on whether it generated something and more on whether the surrounding system can verify, constrain, and appropriately route the result before it is accepted.

This is why headless AI systems should be evaluated on accepted, auditable completion, not just task coverage.

What teams should build now

If the last mile is the real bottleneck, then teams building AI agents should care less about theatrical autonomy and more about completion systems.

A few things become much more important:

1. Verification before celebration

Do not confuse a fluent output with a correct one. Cheap deterministic checks should run wherever possible, and stronger semantic verification should be layered where risk is high.

2. Explicit escalation paths

Not every task should be forced through autonomy. Teams need clear rules for when an agent should stop, ask for review, or hand off to a human.

3. Evidence, not vibes

If an output is accepted, the system should be able to say why. Confidence without evidence is not enough when real workflows are at stake.

4. Completion-oriented metrics

Measure what actually matters: accepted outcomes, retry success, escalation quality, false accepts, audit readiness, and downstream usability.

5. Ambition-aware workflows

Do not use AI only to do the old work faster. Use it to explore a larger design space, attempt more valuable outcomes, and raise the standard of what the team is trying to ship.

6. Domain-specific judgment surfaces

The last mile is rarely generic. It shows up differently in engineering, finance, support, healthcare, and analytics. The harness has to reflect the domain, not just the model.

My take

AI shifts human labor from execution to adjudication.

But that is only half the story. AI also shifts the organization from local optimization toward frontier expansion, if leadership is willing to use the capability that way.

That is why I think the highest-leverage teams will not just be the ones that can get an agent to produce an artifact. They will be the ones that can raise ambition and still decide, reliably and at scale, whether an artifact deserves to become an outcome.

That is a different problem from prompting. It is a different problem from model benchmarking. And it is a much more interesting problem than most of the current discourse admits.

The biggest mistake teams can make right now is optimizing for the appearance of autonomy instead of the reality of completion, or optimizing only for efficiency instead of ambition.

The real battle is in the last mile.

Closing line

The last mile is not a bug in AI agents. It is increasingly the job in the AI era.

Sources

Aaron Levie, “The never-ending last mile of work” (X thread)
Garry Tan, “Boil the Ocean”