Hard boundaries becoming popular!


Prompt injection is a fundamental, unsolved weakness in all LLMs. With prompt injection, certain types of untrustworthy strings or pieces of data — when passed into an AI agent’s context window — can cause unintended consequences, such as ignoring the instructions and safety guidelines provided by the developer or executing unauthorized tasks. This vulnerability could be enough for an attacker to take control of the agent and cause harm to the AI agent’s user.

Great to see this written out explicitly. Just a few months ago labs and app devs were still trying to say that prompt injection can be fixed. Some at Anthropic and OAI still do.


Inspired by the similarly named policy developed for Chromium, as well as Simon Willison’s “lethal trifecta, our framework aims to help developers understand and navigate the tradeoffs that exist today with these new powerful agent frameworks.

I like this name and the analog to Chromium’s rule of two. It’s more down to earth than Lethal Trifecta.


[C] An agent can change state or communicate externally

Does this include the agent’s internal state? Or just any mutative action? I guess the latter. But the former would be crucial moving forward with agents that have a persistent scratch pad.


Here lies the BIG problem: how do you distinguish spam emails (untrusted data) from private emails (sensitive data)?


This is cool. Instead of blocking the “third leg” as suggest by Simon Willison’s Lethal Trifecta, the authors here suggest limiting the input parameter space.


Cool mitigation for browser agents (removing session data)


Author lineage as a way to filter out untrusted data from code sounds very difficult. Most commits aren’t signed anyways.