More ways to manage AI Agents

After my last article on parallels between working with AI agents and managing teams, new power-user patterns for using agentic tools like Claude Code, Codex and Amp have emerged. I like Andrej Karpathy's phrase "some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it" and agree that a helpful framing is the recognition of a higher "layer of abstraction to master".

In particular, users of AI agents (most commonly software engineers but increasingly any digital knowledge worker) are most effective when they think beyond situation-specific prompting and to the bigger project-level picture, asking how the 'harness' (i.e. inputs, outputs, feedback loops and control mechanisms) which wraps around the LLM can be optimised over time to deliver the biggest productivity boost. In the last couple of months, the most powerful AI agents are the ones equipped to control downstream services, and to operate within a feedback loop which compounds their knowledge over time.

Note: I expect this article will rapidly go stale (and it's already a bit behind thanks to the Christmas break) so what follows is my perspective on interesting themes. I'll focus on the aspects of agentic workflows which seem most relevant for knowledge workers (but naturally will use software engineering examples).

Not everyone knows how to write a good prompt, but that's ok¶

As I previously wrote, we should treat AI agents roughly as we would treat a member of our team. Since communication is the main interface between us and them, developing intuition for how to effectively prompt is still essentially a prerequisite. When training people on how to prompt, I encourage 'metaprompting' with tricks like "give me 5 potential answers" or "summarise our chat so far and suggest 3 next steps". But while thinking creatively about prompts can elicit valuable responses, the IQ Bell Curve meme (and the xkcd over-engineering comic) applies here: the most important thing is to set a clear direction and execute without getting lost in extensive prompt-writing.

This is especially true if you're bullish on LLMs (or other architectures) and buy into the idea that models will get smarter and able to deduce user intentions. In this scenario, it's even less important to meticulously craft a domain-specific prompt (or prompt at all), and instead more important to control what actions the agent can take to complete the task in question.

Skilled agents allow users to focus on higher-level tasks, but limitations exist¶

For equipping agents with tools in order to make them more effective at their work, Skills seems to be winning as the stickiest concept. As "lazy loaded prompt engineering" they are much cheaper (in terms of tokens) than MCP servers, and easier to create.

Hugging Face Skills are a really cool example. They provide LLM agents with the ability to perform AI engineering tasks (e.g. "Fine-tune Qwen3-0.6B on the open-r1/codeforces-cots dataset for instruction following"). This is a clear example of using agents to climb the ladder of abstraction to focus on direction rather than details. This example specifically also hints at Recursive Self-Improvement (discussion of which is beyond the scope of this article).

There are interesting commercial implications when we think about agents who are getting more and more capable at consuming existing, or developing new, technology solutions. Hugging Face is built like an ecosystem, and Hugging Face Skills are complementary to driving engagement therein. But making AI agents generally capable at software development speaks to a collapse in demand for SaaS tools in favour of ephemeral software if "many things I'd think to find a freemium or paid service for I can get an agent to often solve in a few minutes". It has been argued that a lot of commercial SaaS tools are basically CRUD apps with a sprinkling of "simple domain logic"; could standardisation of this category into a "compact universal representation" enable not just the on-demand generation of software tools, but also their integrations, migrations, documentation and marketing? The trend towards ephemeral software correlates with the rise of uv (ephemeral environments) in the Python world - which has also been packaged up into an agentic Skill like the HF example.

If empowering an agent with a skillset/toolset is the way to unlock higher-level focus for the user, then we find ourselves in a scenario where the size and complexity of the toolset becomes a bottleneck. In other words, toolset management - how to select, install, review and optimise the choice of tools - is an area for improvement in agentic software:

One challenge is to avoid context bloat where all available tool specs are given to the agent regardless of the incoming user intent (instead, if we know that the user is looking to perform a certain task, then we should filter the toolset to only offer relevant tools to the agent).
Another is to optimise for speed and token efficiency by delegating certain tasks (especially those correlated with specialised tools) to smaller/specialised models.

Harnesses are more valuable when customised in terms of skills and UX¶

The first wave of LLM products (ChatGPT etc) had no real features on which to differentiate beyond the raw capability of their underlying models. But with this growing focus on the skill/tool layer, product offerings are becoming more curated. "A great way to win today is to take that broad stack and narrow it with your opinions". It is clear that understanding how to operate a tailored "AI IDE" (see Stanford CS146S Week 3) is a crucial skill for future users of AI agents.

For example, coding harness Amp is likely picking which LLM to use based on the user's intent and toolset, relieving "decision paralysis" in an "Apple-like" (i.e. curated) experience. It also includes new UX for developers rebalancing how much code they write vs. how much code they review. Similarly, LangChain's LangSmith adds UX for debugging the work of complex agents. Oh My OpenCode is another harness customisation layer that I've seen recently.

Of course, the same bitter lesson of prompt engineering applies to harness engineering: it's almost inevitable that users will go down rabbit holes of customisation without getting real value from their systems; but it's also true that there is still lots of potential value to be gained from UX optimisation of tools which are to be applied to knowledge and project management domains. If companies are able to curate a slick UX then they will be able to drive adoption of traditionally developer-targeted products in non-traditional (non-technical) user segments.

Harnesses don't need to feel like linear conversations with LLMs¶

As agentic AI gets adopted, a range of possible workflows have appeared. It can be a multi-phase and iterative conversation where an agent is directed to 'do the work' (i.e. write the code); but it could also be more useful to have the AI as a design partner and the user 'holding the pen'. It might also be that AI is more tightly embedded into the same UI as the user, being triggered on-demand (like a canvas copilot).

Other deployment patterns break even further from the user-triggers-LLM-via-computer-screen paradigm:

Why not have multiple agents reachable via Slack?
Why not have agents watching your device screens (where they could hype you up like a streamer chat)?
Why not ping agents via your mobile to mitigate the need to reach deep focus for any knowledge work to occur?

With great power comes great responsibility¶

It's worth the reminder: making AI agents more capable, connected and available necessitates controlling how much damage they are able to do. The simplest requirement is managing tool access, but there is also a lot of discussion about managing (simulating) the files that an agent is running over.

The LlamaIndex docs put it well: "One way around this problem is to frequently use human-in-the-loop: while this is a high-success strategy (most people can recognize dangerous actions and block them before they happen), it breaks the autonomy that a coding agent should provide. [...] The second way around this is, counterintuitively, to ban the agent from accessing your actual file system, and make it work in a virtualized copy."

Not everyone is convinced that filesystem virtualisation is the right thing to do: for example is it better for an agent to make 'working copies' at the service level?

What is clear, however, is that agents enable long-term success when they are able to read and write plans and documentation.

Note: I also really liked Simon Willison's "mise en place" metaphor for ensuring that an agent is sufficiently prepared for its task (e.g. plan developed, success metrics set, guardrails in place).

The best systems compound over time¶

Much of the recent attention spent on optimising agentic harnesses has been on making them "become more ideal for you and the task you are pursuing" (over time). The Every blog (somewhat pretentiously) claims that delivery difficulty consistently increases during a traditional software engineering project because of insufficient focus on "compound engineering" i.e. "helping the whole system learn from successes and failures"; and the same dynamic regularly occurs in/after consulting projects where there is never enough time for rigorous retrospectives and codification of reusable assets (templates, processes, automation, etc). So it makes a lot of sense that we want our systems to automatically "learn". In other words, "the future of (coding) agents is memory". The aforementioned Hugging Face Skills demonstrate an example of this where "everything gets captured. Everything compounds".

Basic implementations of a memory system include an ever-growing folder of tutorials/notes about a given project (which can be viewed as a more project-specific flavour of generic Skills) or the regular distillation of user preferences from conversation logs (similar to what LLM chatbot providers do with their Memory functionalities).

But two other implementations of memory systems (in the sense of writing status updates and plans) have gone viral recently: beads and Agent Mail.

beads

Agents autonomously create and track their own issues/tickets using the bd framework
This mitigates the need for a human user to always be context engineering via writing and updating plans and lists; and means that agents can choose their own ordering of tasks
Interestingly, it's designed specifically as a framework for AI, not humans, to use - heralding a new paradigm of "UX for agents" (I particularly liked the bootstrap installer where you simply tell your agent to use the newly available bd command)
Humans may prefer to utilise beads viewers to explore the tickets (Simon Willison again)

Agent Mail

On particularly challenging problems, it's empirically useful to have a "swarm" of agents working together; a 'mail' system is where agents coordinate by sending messages to each other
It works well in combination with beads
It signals a continued trend towards multi-agent systems (although it seems that there are multiple patterns - fan-out, council, orchestrator-subagents - in contention)

So what?¶

AI agents are coming for all knowledge work. Skills and integrations are enabling the application of ostensibly coding agents (Claude Code etc) to business admin, communications, content creation, home automation tinkering and much more; and deep research agents are being adopted as "thinking partners". Certain aspects of knowledge work - design decisions, stakeholder alignment, etc - may be the harder unlock when compared with code generation, but they will not remain the exclusive domain of humans for long (in fact "managers have been vibe coding forever" i.e. as long as task execution happens, the human/managerial domains do not need to be perfect in order for a project to progress).

If we get the balance wrong, then knowledge workers may become the "reverse centaur" consigned to permanent human-in-the-loop review of AI-generated outputs. But in the meantime, understanding how to manage these alien abstractions is both useful and enjoyable.