Where the AI Community is Missing the Mark on Automated Knowledge Work
In the fast-evolving world of artificial intelligence, a crucial debate is taking center stage: should AI systems serve as fully autonomous agents or as assistive copilots? The topic has far-reaching implications for how the world will change over the next 5 years, but has been muddled in speculation, missing the nuances of implementing AI-based products in practice.
As an active builder in the AI space, and a veteran of the self-driving industry that experienced similar hype and confusion, I’d like to contribute some insights to help make sense of the AI product rollout. The tools we developed to frame automated driving can help us understand whether AI knowledge workers will look more like Copilots or Agents.
I’ll focus on three insights in this post, saving some others for future discussion.
Let’s unpack each of these.
Every automated system, no matter how mature, will have a failure rate. Acknowledging this principle is a vital step to building a good System of Automation, but it’s surprisingly under-discussed in the public conversation about AI.
In safety-critical systems like self-driving vehicles, the cost of a mistake can be fatally high. The world witnessed a proliferation of self-driving startups through the 2010s, but few players remain in the hunt for full self-driving today because it’s very difficult to fully automate when the stakes are so high. It’s relatively easy to build a demo vehicle that steers itself down the street, and much much harder to approve fully autonomous commercial operations. This is why we’ve had driving copilot features (lane-keeping, emergency braking systems, etc) for years, but have had to be patient for full self-driving.
The knowledge-working AI agent may not have human lives in its hands when it’s filling out a form or writing a line of code, but its reliability and the cost of its mistakes will be the difference between positive and negative ROI. At Vooma, one of the skills we first built into our AI agent was data-entry for freight. We operate under the assumption that making a mistake in entering critical truckload data has a cost that is 10-100x the magnitude of the task itself. It takes much more effort to track down and fix an error in an invoicing process than it does to enter the data in the first place. We put significant effort into the integrity of our solution because we know that if we don’t, we jeopardize the ROI of the automation we offer.
Simple tasks like data entry are relatively easy to evaluate and track. What happens when the work is less rote, and when mistakes are costly? If a $5Bn lawsuit hangs in the balance, your AI lawyer would need to be at least 99.9% as accurate as a human for it to be a cheaper solution (by expected value) than even a really expensive human lawyer. The cost of a mistake in this example will tend to favor the copilot model, where human productivity is augmented without requiring complete trust in the autonomous system.
In the case of the AI software engineer, a mistake could mean a really tricky bug that doesn’t surface for weeks. When it does, it may take a human software engineer 3x longer to orient in the code and root-cause than they would have taken to write the feature itself. Software is often maintained by teams today because having context is efficient. The degree to which software engineers and companies prefer copilots vs agentic tools will reflect their underlying assessments of ROI, which are shaped mainly by the rate of failures and the cost of fixing those mistakes.
There’s no question that AI coding competence is going to improve quickly, likely following an S-shaped curve. We’re about to ride the steep slope up the back of the ‘S’. The last 10% will likely be quite hard. One framework for predicting the shift from copilots to agents as this improvement plays out will be to estimate the point at which the burden saved upfront outweighs the cost of fixing errors.
Devin can autonomously handle 13% of software development tasks correctly. Do you know any hiring managers who would extend an offer to a software engineer who can only complete 13% of tasks? How about 50%? Is 90% good enough? The answer, of course, is “it depends”.
It depends on the degree to which errors are correlated. If Devin randomly makes mistakes independent of the task, and you can’t distinguish any patterns that explain when these failures tend to occur, it will be useless for quite a while. For the nerds reading this, I’m referring to the possibility that mistakes may be Independent and Identically Distributed (IID). If, on the other hand, you can start to see patterns and say “it’s really good at X but not Y”, then a useful contour appears around which we can start to trust the AI agent to take on work.
This is the concept of an Operational Design Domain. In self driving, the ODD comprises the set of conditions under which we trust the robot to function fully autonomously. These may be specific locales and geographies, like the Waymo service zones. They may be weather conditions or times of day or night. ODDs have emerged as vital constructs in the pursuit of autonomous driving, because without them, we’d have no idea where and when the technology can be profitably and safely used.
We will see the same thing with autonomous knowledge workers. Fully autonomous agents will first commercialize in operating conditions that can be easily recognized and systematized. For autonomous software engineers, we may find that this means laying out frontend components, or writing unit tests, or bug-fixes that span a limited number of files. After articulating some patterns that define an ODD, you can intersect them with the activities that are most burdensome and/or valuable, and identify opportunities for PMF. You can also start to refine guardrails and fallback mechanisms within the ODD to harden your agent against the variation that you expect to see.
AI agents will be adopted most readily when it’s easy to say what they are good at. As engineers and entrepreneurs, this means designing around ODDs.
When mistakes are costly, you don’t want to act the same way whether you’re confident about a decision or not. If you can’t see around a curve, you slow down. If you’re 75% sure that the shadow you see on the road ahead is not a pedestrian, you proceed with caution. If you’re an autonomous vehicle and you have no idea what the object in front of you is, you wait in place while you ask a human to help. Fully automated systems must reason about the trustworthiness of their data and the confidence of their hypotheses.
One of the coolest things you learn if you study State Estimation is that every observation reduces uncertainty. Even a data point from a very noisy sensor reduces your overall uncertainty, as long as your model for that noise is characterized. The lesson here is that if you’re building an autonomous system (or agent), you should never throw away data, and you should reason thoughtfully about your uncertainties.
I’ve seen and made the mistake a number of times of building a system that makes decisions based on the downstream transformations of a thresholded confidence value. In almost all cases, propagating that confidence downstream has been critical to breaking through an accuracy plateau.
If you propagate and reason thoughtfully about uncertainties, you afford your agent the option of saying “I don’t know” and calling in a human to help. When mistakes are costly, this is almost always better than guessing. It also helps you expand your ODD because you have defined a safe outcome for many of the instances where you may have made a costly mistake.
Agents that reason about uncertainty and ask for help when they need it will engender trust and find useful employment much sooner than ones that don’t. As an aside: Foundation model developers, please invest in more calibrated uncertainty tools! Returning log-probs to API responses is a great start, but I think we can do even better!
Interestingly, this final insight doesn’t support one side over the other in the debate between copilots and agents. It reveals that there actually is no hard line between the two approaches. Agents will need to reason about uncertainty and ask for help, which in a funny way turns them back into copilots.
Automating knowledge work like software engineering will be subject to many of the same risk and ROI considerations that have shaped the development of self-driving vehicles. Eventually we will get to an extremely high level of agentic automation, ROI determined fundamentally by the cost of a mistake. Improvement will be S-shaped, and the tasks that get automated first will be the ones where cost of mistakes is lowest. ODDs will emerge to describe the conditions in which agentic automation makes sense over copilot assistance. Agents that take on tasks with costly mistakes will need to be aware of their uncertainties and will be more likely to be copilots for longer.
What we'll likely find, however, is that, the distinction between copilots and agents is less meaningful than it seems at first glance, because an agent that asks for help is a type of copilot itself.
If you're interested in learning more about how we're building the AI Agent for logistics, drop us a line!