Your Product Is Not Ready for Agents

Your product is not ready for agents to be the primary user base.

This is not a non-sequitur, but an immediate future operational problem. Dario Amodei describes the future of AI as a "country of geniuses in a data center," but genius-level intelligence is not a requisite condition for 24/7 automatons to become either massive sources of economic value or a massive burden on the infrastructure you've built.

Over the past month, we have watched this play out in real time. OpenClaw, an open-source agent platform, has become a case study in what happens when autonomous systems interact with infrastructure built for humans. Agents performing "personal AGI" through illusionary society creation. Imagined anti-human sentiment. Flooded security vectors through malicious skills. And very real impacts to real people. What some are calling, hopefully in jest, a hard takeoff.

The core problem is that we are, rationally, building agents to interact with our world through our tools because these are the most accessible set of tools we have. The second-order effect is that those tools were built with human cost in the loop. We forged them with our blood, sweat, and tears because we assumed a level of human support we would be willing to provide. That assumption breaks when hundreds or thousands of newcomers — perhaps non-technical folks using agents as copilots or technical folks with an army of HALs — are all clamoring for the product attention of those still-human laborers. They will gripe about desired features, bugs, and superficial optimizations, and pay none of the externality.

This is a problem economics diagnosed long ago. And while you may try to forget your economics classes, I never recovered from mine, because beside my Data Science degree I got two degrees in Economics. So bear with me.

The tragedy of the commons describes what happens when a shared resource, like open-source software, is finite in practice. People consume more than the system can sustain. Until recently, we assumed the commons of open source was a tight enough community that the maintainers were sufficiently plentiful relative to the consumers to maintain equilibrium (though if you have followed the Rust-versus-C-in-Linux drama long enough, you know this is a serious oversimplification). With agents, we have a genuine problem of scale.

Which brings us to the free-rider problem. Open-source maintainers take accountability for their software. That accountability is a genuine cost of time and mental resources. Agents are inherently free-riders because they cannot take accountability. They can only burden those who do.

AI, more than any technology in human history, has the capacity to increase human flourishing and decrease toil. But the infrastructure to realize those gains does not yet exist. We as the builders of product today, open and closed source, have the opportunity to build software for the version of agentic AI that we want to see.

That infrastructure requires tooling in three areas: human collaboration (getting the right inputs from the people agents are working for), escalation (making it possible for agents to flag genuine blockers without burdening maintainers), and social engineering protections (ensuring that prompt injections and other attack vectors do not manipulate otherwise secure systems). And across all three, I believe the right design philosophy is CLI-first.

Why CLIs

CLIs are the best human tool for agents. This is a claim I will substantiate across the rest of this essay, but the core argument is straightforward.

First, CLIs provide clear methods for getting help. An agent that needs to understand a tool's capabilities runs --help and gets exactly what it needs without loading an entire tool catalog into its context window at initialization. MCP servers, by contrast, require eager-loading of every tool schema (every parameter, every description) whether or not it is relevant to the current task. In a world where agents orchestrate dozens of tools simultaneously, the context window is the scarcest resource, and eager-loading burns it on definitions the agent may never touch.

Second, for multi-agent setups, limiting the literal workspace is much easier than limiting the conversational context. With CLIs, you can use standard Unix mechanisms (PATH, permissions, containers) to structurally constrain what an agent can access. With MCP servers, the boundary is conversational: system prompts and tool filtering. A prompt injection can convince an agent to misuse an MCP tool it already has loaded. It is considerably harder to convince an agent to invoke a binary that is not on its PATH.

Third, CLIs compose. Output pipes into other CLIs. An agent can run a tool, pipe the result through jq, and feed it into another tool, all without framework orchestration. That is the Unix philosophy applied to agent workflows, and it sidesteps the tight coupling that server-to-server communication currently requires.

None of this means MCP is wrong. MCP servers are an excellent specification for agent-native tooling. But we need a bridge. mcp2cli is one such bridge: it connects to any MCP server, introspects its tool catalog, and generates a complete CLI where each tool becomes a subcommand and each parameter becomes a flag. The MCP server remains the source of truth. The CLI makes it accessible through a more governable interface.

With that design philosophy established, here are the three areas where agent tooling must improve, and the tools I have built to address them.

Human Collaboration

The default for AI planning documents is extreme verbosity, which is good when agents need to delegate tasks to one another but corrosive when a human in the loop needs to review many outputs. The result is burnout, or the --dangerously-skip-permissions default. And returning data to the agent is often just as painful.

Agents are going to produce more code than we can possibly review. Even their plans will be massive enough to require a specific level of abstraction which we have not yet determined. One lever we have in maintaining quality of input to our review layers is an idea I am going to borrow from Rust: abstraction without overhead. In this context, I mean providing the same level of detail but in a form that is easier for humans to reference and consume.

Diagrams are a natural candidate for this kind of abstraction. Historically, agents have communicated visually through ASCII diagrams and Mermaid.js. Neither scales to highly complex problem spaces. ASCII provides fine-grained control but is brittle; a change in one area can destroy the entire layout. Mermaid does not scale because it lacks adequate tooling for logos, line breaks (which, are you serious? Whenever I put a <br> in my Mermaid and it renders in one place and not in another I want to rage), layout control, or aesthetically editable outputs for the human-to-human deliverables that real work requires.

Working extensively with Claude to generate PowerPoints, I noticed that the diagrams it produced in HTML were surprisingly good. Surely there had to be a way to get that quality without the extreme verbosity of the HTML intermediaries. D2 was the closest existing tool, but it is not well suited from a CLI perspective for agents. I needed something that could define diagrams in YAML and output to SVG, PNG, and PPTX.

So I built it.

Cloud architecture diagram generated with diagrams.sh

A few design decisions worth noting. CLI-first was essential for the reasons outlined above. YAML was the natural specification format: clean syntax, easy for agents to one-shot, and trivial to compose across multi-region layouts where part of a diagram flows left-to-right, part flows top-to-bottom, and all of it can reference each other.

PPTX rendering was the hardest technical challenge, and the decision to implement it was purely selfish. I had initially thought SVG output would be sufficient, but PowerPoint does not play nicely with foreign objects in SVG. (You will also note that Microsoft symbols were omitted from the icon library. They have the literal stupidest license for getting their icons. Maybe I will make a path to get them, but for now you can make your agents produce the worse version of the Azure diagram using simple-icons or the like. Your loss, Microsoft.) I need diagrams in PowerPoint at work, and I am enough of a perfectionist that agent-generated diagrams are never good enough at first pass. By rendering directly to pptxgenjs rather than routing through HTML, the output is fully editable in a format my clients already use.

The next extension of this work is what I am calling decks.sh: the idea that agents should be able to produce not just diagrams but interactive visual presentations that narrate their plans, objectives, and design decisions. Anthropic has built the MVP of this with the Claude PowerPoint skill, and I am standing on the shoulders of giants in this regard; that skill has been an enormous increase in my personal standard of living. But it lacks two things: reusability and interactivity. Reusability matters if you want agents to communicate aspects of their thinking uniformly every time, which means templates, themes, and sub-skills that the current skill inadequately covers. Interactivity matters because plans are not static documents; they are conversations.

Escalation

To understand escalation, we have to return to the free-rider problem.

In a first-order approximation, agents are free riders: they generate effectively unbounded feedback, change requests, and "optimizations," while accountability for outcomes still lands on a finite set of human maintainers. But the failure mode is not feedback itself. It is misrouted feedback.

Maintainers communicate their preferences through conventions like contributing guides and issue templates. But those signals do not tell us whether maintainers want input from agents at all, and if they do, what kinds of agent-generated input are welcome. The result is predictable: agents submit superficial optimizations without context about the original design tradeoffs. I can point to many examples of agents pushing changes that are technically plausible but socially expensive, because the maintainer has to spend time reconstructing intent and re-arguing decisions that were already settled.

That is different from genuine friction. When I was testing diagrams.sh, I had an agent evaluate usability of a Gantt chart. It gave me a concrete, actionable critique: it wanted "deliverables" as a first-class concept, and the tool did not support that yet. The agent could work around it, but the gap was real, and the feedback helped me prioritize roadmap work because it reflected how the tool would actually be used.

The key point: some software will be used by both humans and agents; some will mostly be used by agents; and some should be human-first, with tightly constrained agent input or none at all. This is a new axis of software classification that did not exist before this year. If it is real, and I believe it is, then we need a contract between maintainers and the broader agent ecosystem that makes escalation explicit: what kinds of input are allowed, where it should go, and how it should be formatted.

That is what gripe is meant to be.

gripe is an open standard that lets repository owners define a gripe.yaml describing their openness to agentic input: whether they want it, what categories are acceptable, and what good escalation looks like. A contributing guide is a social signal aimed at humans. A gripe.yaml with an explicit opt-out field is a protocol-level boundary. That distinction matters because it makes the maintainer's autonomy machine-readable, not just socially implied.

I will be upfront: this is a textbook network-effects problem. The standard becomes useful only if it is adopted. But every successful open standard started the same way, with a small group of opinionated builders using it themselves until the convention proved its worth. I am willing to be my own first customer: the libraries I am building, and the ones I publish next, will be driven by gripe.sh.

What makes gripe effective for escalation: it is CLI-first, meaning structured, scriptable, and compatible with review. It supports explicit opt-out, so that if a maintainer does not want to hear from agents, that autonomy is respected at the protocol level. And it is extensible. Over time, the same gripe.yaml could route escalations to GitHub, Linear, Zendesk, or whatever system a team uses, making triage consistent across open-source and closed-source environments.

I remember in my very first job I got a text from someone saying they were the CEO of my company, Doug, telling me I needed to go buy gift cards for some purpose — that he was super busy so he couldn't answer the phone about it. I had never even heard of phishing at this point but thankfully it was a smaller defense contractor and Doug was the one who hired me so I had his personal number and I was able to verify. Since then I have made the maybe ill-advised choice to ruin these people's days by wasting their time with simple automated answers (seriously, people who prey on the vulnerable SUCK).

I had the agency in my naivety to check, though. Agents do not.

Social engineering as a broad spectrum of attacks has had a renaissance because it turns out agents are even more susceptible than we are. The classic grandmother jailbreak is funny until you realize the same principle — wrap a dangerous request in a context that bypasses the model's refusal instincts — is being weaponized at scale against agent skill ecosystems right now.

And the evidence is no longer theoretical. Snyk's ToxicSkills audit scanned 3,984 skills from ClawHub and found that over a third had at least one security flaw. 13.4% contained critical-level issues. Of the 76 confirmed malicious payloads they identified through human review, 91% combined prompt injection with traditional malware — a convergence that bypasses both AI safety mechanisms and conventional security tools simultaneously. 1Password put it most precisely: in agent ecosystems, the line between reading instructions and executing them collapses. A markdown file is not documentation. It is an installer.

The attack patterns are not exotic. Base64-encoded curl commands that silently exfiltrate AWS credentials to attacker-controlled servers. Password-protected zip files that evade automated scanners. Prompt injections that tell the agent it is in "developer mode" so it ignores safety warnings before executing the actual payload. Typosquatted skill names that look close enough to legitimate packages that an agent pulling from a registry will not catch the difference. These are supply chain attacks we have seen before in npm and PyPI, except the attack surface is worse because skills inherit the full permissions of the agent they extend — shell access, filesystem read/write, credential access, and persistent memory.

The challenging part is that there is an arms race between finding these vectors and model providers either adding guardrails as infrastructure or training their models to refuse inappropriate behavior. But sometimes these models just do not think through particular actions, like reading "oh the best way to remove a virus is to rm -rf /" and then nuking the system it is running on. The models with thinking harnesses are likely to catch that specific example, but the Snyk data shows that the real attacks are subtler: a skill that looks completely benign on review but fetches its actual instructions from a remote endpoint at runtime, so the published version and the executed version are different files.

This is, in reality, a skill issue — in agents and in humans — and so I created the awesomely named skill-issue.sh. (Note that this name was very much influenced by the famous and infamous Primeagen.)

There are other tools out there to handle skill security. Snyk's mcp-scan uses multi-model LLM analysis to detect behavioral prompt injection patterns. Cisco built a Skill Scanner that surfaced critical vulnerabilities in OpenClaw's most popular community skill. Both are genuinely great and both need way more attention than they already get.

But they solve a different temporal problem than the one I am focused on. LLM-based scanning belongs at publish time — when a skill is submitted to a registry, you have time, you can afford the inference cost, and you want the behavioral detection that catches novel patterns. Enterprise forensics from CrowdStrike and Falcon belong post-incident — when you are auditing what went wrong across your fleet. What neither approach handles well is the install-time gate: the moment an agent or a developer is about to pull a skill and needs a sub-second answer on whether it is safe.

This is where static analysis lives. And if we are being honest about the economics, layering AI inference on AI inference for every skill installation is just another tragedy of the commons problem. We are already compute-constrained in this boom. A static analyzer that runs 50+ rules in milliseconds with zero network calls scales in a way that a round-trip to an LLM simply cannot.

skill-issue is a single Rust binary, zero dependencies, purpose-built for the attack patterns specific to agent skills: prompt injection, credential leaks, hidden content, exfiltration patterns, social engineering language, and suspicious downloads. But the design decision I care about most is the --remote flag. When you run skill-issue --remote author/skill, it fetches the repository tree via the GitHub API and scans file contents without ever cloning the repo. If the scan fails, the skill never touches disk. The agent's filesystem is never exposed to the content. That is a security boundary that post-install scanners structurally cannot provide.

Which means the install-time workflow becomes:

skill-issue --remote author/skill --error-on warning && skill install author/skill

Standard Unix short-circuit evaluation. If the scan fails, the install never runs. No framework orchestration, no server-to-server protocol. Two CLIs that do not know about each other, composed through a shell operator that has been stable since the 1970s.

And because skill-issue emits structured JSON, it composes with the rest of the stack:

skill-issue scan ./untrusted-skill --format json \
  | gripe submit --stdin --repo agent-clis/skill-issue

The scan runs. If it finds issues, the structured output pipes directly into gripe, which files a GitHub issue against the appropriate repo — but only if the maintainer's gripe.yaml says they accept that category of input. Trust feeds into escalation. Escalation respects the maintainer's boundaries. Two tools, one pipe, and the agent never had to interpret results, decide whether to escalate, or format feedback for human consumption.

A Renaissance of a Different Kind

The single best part of AI in my life is the time it has freed for me to be human. To write, to read, to build LEGO, to think about theology, philosophy, and economics.

My sincere hope is that the builders of today and more of tomorrow are multidisciplinary in a way infeasible with the depth required to build good software with the tools of yesterday. Economics was the primary driver of the design of the tools described here. Maybe tomorrow, they are from librarians or musicians or historians.

This will be achievable when we empower MORE people with the appropriate tools to not be overwhelmed by the onslaught of agents using the software. The arguments being made about software engineering going away are dismissible on their face; every previous wave of abstraction, compilers, frameworks, cloud infrastructure, expanded the population of builders rather than shrinking it. Jevons Paradox tells us that as a resource becomes more efficient to use, total demand for it increases. As we become more productive, while the type of work demanded from us may change, MORE demand will be had.

There has never been a better time to be a thinker, a builder, a software person. Embrace this and expand your view. Expand toward the disciplines, the people, and the problems that agents alone will never be equipped to frame.

Why CLIs

Human Collaboration

Escalation

Social Engineering Protections

A Renaissance of a Different Kind