Michael Bargury

Security research, hacking, AppSec, primarily focused on AI agents.

Why Aren't We Making Any Progress In Security From AI

July 19, 2025

Guardrails Are Soft Boundaries. Hard Boundaries Do Exist.

Yesterday OpenAI released Agent mode. ChatGPT now wields a general purpose tool – its own web browser. It manipulates the mouse and keyboard directly. It can use any web tool, like we do.

Any AI security researcher will tell you that this is 100x uptake on risk. Heck, even Sam Altman dedicated half his launch post warning that this is unsafe for sensitive use.

Meanwhile AI guardrails are The leading idea in AI security. It’s safe to say they’ve been commoditized. You can get yours from your AI provider, hordes of Open Source projects, or buy a commercial one.

Yet hackers are having a ball. Jason Haddix sums it up best:

AI Pentest: A client pays an exorbitant amount of money for guardrail and implementation consulting services from a defensive AI Security vendor.

Bypassed in 20 minutes.

It really does feel like the dawn of web hacking all over again.
— JS0N Haddix (@Jhaddix) July 14, 2025

In Hard Boundaries We Trust

SQLi attacks were all the rage back in the 90s. Taint-analysis was invented to detect vulnerable data flow paths. Define user inputs as sources, special character escaping-function as sanitizers, and database queries as sinks. Static analysis tools analyze the software to find any route from source to sink that doesn’t go through a sanitizer. This is still the core of static analysis tools.

Formal verification take this a step further and actually allow you to prove that there is no unsanitized path between source and sink. AWS Network Analyzer enables policies like “S3 bucket cannot be exposed to the public internet”. No matter how many gateways and load balancers you place in-between.

ORM libraries have sanitization built-in to enforce boundaries. Preventing XSS and SQLi. SQLi is solved as a technical problem (the operational problem remains, of course).

With software you can create hard boundaries. You CANNOT get there from here.

Hard boundaries cannot be applied anywhere–they require full knowledge of the environment. They shine when you go all-in on one ecosystem. In one ecosystem you can codify the entire environment state into a formula. AWS Networking Analyzer. Django ORM. Virtual machines. These are illustrative examples of strong guarantees you can get out of buying-into one ecosystem.

It’s enticing to think that hard boundaries will solve our AI security problems. With hard boundaries, instructions hidden in a document simply CANNOT trigger additional tool calls.

Meanwhile we can’t even tell if an LLM hallucinated. Even when we feed in an authoritative document and ask for citation. We can’t generate a data flow graph for LLMs.

Sure, you can say the LLM fetched a document and then searched the web. But you CANNOT know whether elements of that file were incorporated into web search query parameters. Or whether the LLM chose to do the web search query because it was instructed to by the document. LLMs mix and match data. Instructions are data.

Hackers Don’t Care About Your Soft Boundaries

AI labs invented a new type of guardrail based on fine-tuning LLMs–a soft boundary. Soft boundaries are created by training AI real hard not to violate control flow, and hope that it doesn’t. Sometimes we don’t even train for it. We ask it nicely to apply a boundary through “system instructions”.

System instructions themselves are a soft boundary. An imaginary boundary. AI labs train models to follow instructions. Security researchers pass right through these soft boundaries.

Sam Altman on the announcement of ChatGPT Agent:

We have built a lot of safeguards and warnings into it, and broader mitigations than we’ve ever developed before from robust training to system safeguards to user controls

Robust training. Soft boundaries. Hackers are happy.

This isn’t to say that soft boundaries aren’t useful. Here is ChatGPT with GPT 4o refusing to store a malicious memory based on instructions I placed in a Google Drive document.

ChatGPT 4o refuses to store a memory based on instructions in a Google Drive document

Check out the conversation transcript. More on this at BHUSA 2025 “AI Enterprise Compromise - 0click Exploit Methods”.

LLM Guardrails addressing Indirect Prompt Injection are another type of soft boundary. You pass a fetched document through an LLM or classifier and ask it to clean out any instructions. It’s a sanitizer, the equivalent of backslashing notorious escape characters that lead to injections. But unlike software sanitizer, it’s based on statistical models.

Soft boundaries rely on training AI to identify and enforce them. They work most of the time. Hackers don’t care about what happens most of the time.

Relying on AI makes soft boundaries easy to apply. They work when hard boundaries are not feasible. You don’t have to limit yourself to one ecosystem. They apply in an open environment that spans multiple ecosystems.

* The steelman argument for soft boundaries is that AI labs are building AGI. And AGI can solve anything, including strictly enforcing a soft boundary. Indeed, soft boundary benchmarks are going up. Do you feel the AGI?

Every Boundary Has Its Bypass

Both hard and soft boundaries can be bypassed. But they are not the same. Hard boundaries are bypassed via software bugs. You could write bug-free software (I definitely can’t, but YOU can). You can prove correctness for some software. Soft boundaries are stochastic. There will always be a counter-example. A bypass isn’t a bug–it’s the system working as intended.

Summing it up:

Boundary	Based on	Applies best	Examples	Bypass
Hard boundary	Software	Within walled ecosystems	VM; Django ORM;	Software bug
Soft boundary	AI/ML	Anywhere	AI Guardrails; System instructions	There will always be a counter-examples

Hard Boundaries Do Apply To AI Systems

Hard boundaries are not applicable to probabilistic AI models. But they are applicable to AI systems.

Strict control of data flow has been the only thing that has prevented our red team to attain 0click exploits. Last year we reverse engineered Microsoft Copilot at BHUSA 2024. We spent a long time figuring out if a RAG query results can initiate a new tool invocation like a web search. It could. But Microsoft could have built it a different way. Perform RAG queries by an agent who simply cannot decide to run a web search.

Salesforce Einstein simply does not read its own tool outputs. Here is Einstein querying CRM records. Results are presented in a structured UI component, not summarized by an LLM. You CANNOT inject instructions through CRM results. Until someone finds a bypass. More on this at BHUSA 2025 “AI Enterprise Compromise - 0click Exploit Methods”.

Salesforce Einstein does not read its own tool outputs. Image by Tamir Ishay Sharbat.

Microsoft Copilot simply does not render markdown images. You CANNOT exfiltrate data through image parameters if there’s no image. Until someone finds a bypass.

ChatGPT validates image URL before rendering them using an API endpoint called /url_safe. This mechanism ensures that image URLs were not dynamically generated. They must explicitly be provided by the user. Until someone finds a bypass.

The main issue with hard boundaries is that they nerf the agent. They make agents less useful. Like a surgeon removing an entire organ out of abundance of caution.

With market pressure for adoption, AI vendors are removing these one by one. Anthropic was reluctant to let Claude browse the web. Microsoft removed Copilot-generated URLs. OpenAI hid Operator in a separate experimental UI. These hard boundaries are all gone by now.

The Solution

This piece is too long already. Fortunately the solution is simple.

Here’s what we should

Claude says bye bye

Tags: Hacking, LLM, AI, Guardrails, AI Agents

OAI Q&A on Security From AI

May 12, 2025

This is part 3 on OpenAI’s Security Research Conference. Here are part 1 and part 2.

As soon as they opened up the room for questions I raised my hand. I was prepared. I also primed a member of their technical stuff in advance, joking if we could ask “real questions”, to which he replied – what is a real question? People asked very real questions and got very real answers, kudos to the OAI’s team for their openness to debate.

I ended up asking two questions (thank you Ian). Here is an imperfect summary of a few questions and answers I found interesting, including my own. These are my recollections after more than 48 hours and 24 hours in an airplane, so please take it with a grain of salt.

Question: LLMs were a black box from the get go, and are only getting more obscure with reasoning models. How can we trust them if we can’t figure out what they are doing?

Answer: Doesn’t have high hopes for mechanical interpretability, much of the results have been over-stated. They do have other promising ideas. He believes that hallucination will be solved.

Question: Content moderation is pushing offensive security researchers to use weaker models (not OpenAI). Would you consider a program where they could get unfiltered access to models?

Answer: Yes, we are thinking about it. We want the good guys to have a head start.

Question: What security problems you think the community should focus about, besides prompt injection?

Answer: Privacy. Attackers getting the model to regenerate training data, thereby getting access to information they shouldn’t have access to ([MB] another user’s data used for training).

Question: Given that, as you stated, prompt injection is still a big problem, and getting to 99.999% wouldn’t prevent attackers from getting their way, how should people think about deploying agents that can now have tools that can make real harm?

Answer: People should not deploy agents that can make real harm. He believes that some of the research they are working on could solve prompt injection 100% of the time.

Sam Altman thinking about a question; Matt Knight preparing to fire the next one

Tags: Hacking, Vulnerability Discovery, AI, Red Teaming, OpenAI

Fully-Autonomous AI Systems Are Discovering Vulns Today

May 08, 2025

This is part 2 on OpenAI’s Security Research Conference. Here is part 1.

People have built incredibly capable autonomous systems already. Some under the umbrella of AIxCC, and some independently. These systems are discovering vulnerabilities in Open Source projects, pen-testing web services, crafting patches. XBOW is now first in the HackerOne US leaderboard. OpenAI’s Aardvark found vulns in major OSS projects, with their security research team validating and reporting the bugs. “Aardvarks eat bugs”. Perhaps this openssh bug? AIxCC semifinalists have built systems that don’t stop at findings bugs, they patch (most of) them.

Most powerful results were demonstrated on visible source code and nudgeable APIs. We’ve seen AI reason over an entire codebase, hone in on auth-related classes, get confused and pigeonholed, reason out of it (edit: found it). While initial intuitions relied on traditional static analysis and fuzzing to guide AI, some of these are now set aside. As Dave Aitel put it in his keynote, the LLM understands and reasons rather than scans. Bugs are miscommunications in LLM’s native language – code.

XBOW leads the HackerOne US leaderboard true to May 5th, 2025

The people building these impressive systems are not focused on training models (tho say its the next step). The are building scaffolding – software that uses LLMs in just the right way to squeezes maximum value. They spend their time experimenting with context. Is it better to chuck in the entire codebase? The current location with 100 lines from each side? 200 lines? Do the outputs of traditional security tools help? They also benchmark models for specific tasks. Finding that just because a model is good for writing code it doesn’t mean that its good for reading it. They create evals that help them track which model is best for a given sub-task as-of-today. On top of building an AI harness, they also have to solve real-world engineering problems. Support an obscure build system. An uncommon language. These challenges seem very similar to those Cursor and other coding assistants have to solve.

These left a very strong impression on me. We are going to see a fundamental change in the analysis systems we build and use. That shift is possible with the models we’ve got today.

While the tech is there, a common theme has been that humans are not ready to leverage it. Code maintainers wouldn’t know what to do with hundreds of vulnerabilities submitted all at once. Reports will eventually include auto-generated patches. But trust in automated systems is low. Human systems (i.e. companies) are complicated and often learn the wrong lesson. AI systems that can find vulns, but more importantly – write patches. Will we apply them?

I think we will, actually. We are pretty adaptive, when we must be.

Tags: Hacking, Vulnerability Discovery, AI, Red Teaming, OpenAI

The Vibe at OpenAI's Inaugural Security Research Conf

May 04, 2025

The conversation around AI is always about vibes. So let’s talk about the vibes at OpenAI’s inaugural Security Research Conference last week.

TL;DR: excitement, inquiry, humbleness.

A few observations stuck with me. Most of these surprised me at the moment, but seem trivial in retrospect.

Walking into a room full of people whose work I’ve devoured for years was intimidating. But everyone was genuinely curious and out to learn from each other. There’s a sense that we are all n00bs trying to figure it out. It’s humbling to see world-class experts step out of their zones and dive into emerging research. Every hallway conversation initiated in fifth gear. Folks were happy to openly share perspectives. They immediately inquired about anything that seemed novel or different from their observations.

Most of the focus was placed on AI for security, not much on security from AI. There was general consensus that prompt injection and hallucinations are a massive security problem. When it came to solutions, most people believed that prompt injection is not a solvable problem. Though some – particularly those who work at AI labs – hinted at ongoing research, and expressed hopes of absorbing most of the problem away from customers. I’ve made the case before why I believe this line of effort is futile, and we need a radically new approach.

The relationship between government and OpenAI was palpable. Observable from friendly relationships, and the strong emphasis on policy conversations relative to security conferences: opening keynote, multiple talks. Most subtly – physical security guards at every corner also gave it away. These left an overwhelming appreciation of what’s at stake at the national level (or even Western level). Last time I sensed that mix of urgency and gravitas I was in uniform.

OpenAI Security Research Conference Badge

In government circles, 0day development remains the pinnacle of security research. I almost forgot about that, having spent more than a decade out in industry where we’re still struggling with “the basics”. It explains then why the allure of having AI find 0days is so strong. Its intuitively understood as the ultimate test of intelligence. An exciting goal for both security and AI researchers.

There is a ton of innovation in and around AIxCC, both by contestants and at DARPA. That was expected. I was flabbergasted the first time I learned about CGC. It seemed futuristic and, to be honest, unattainable even though I was learning about it in retrospect. The most exciting thing about AIxCC innovation – they seem to be applicable to all of us right now. I’ll be following the teams very closely from now on. Can’t wait to read the writeups.

People have built incredibly capable autonomous systems already. These systems are discovering vulnerabilities in Open Source projects, pen-testing web services, crafting patches. Rather than relying on traditional security tooling, they lean on the LLMs. As Dave Aitel put it in his keynote, the LLM understands and reasons rather than scans. Bugs are miscommunications in LLM’s native language – code. Incredible results are achieved by combining software engineering and data science skills, no heavy model-training required. There is a general feeling that we’re crossing a chasm and everything is about to change.

I really appreciate OpenAI bringing the community together, and everyone that contributed and shared their perspectives. A huge thank you Dave Aitel, Matt Knight and Ian Brelinsky for throwing a welcoming, super interesting and fun conference!

Tags: Hacking, Vulnerability Discovery, AI, Red Teaming, OpenAI

There Is Nothing Responsible About Disclosure Of Every Successful Prompt Injection

April 29, 2025

The InfoSec community is strongest when it can collaborate openly. Few organizations can fend off sophisticated attacks alone—and even they sometimes fail. If we all had to independently discover every malware variant, every vulnerability, every best practice, we wouldn’t get very far. Over time, shared mechanisms emerged: VirusTotal and EDR for malware, CVE for vulnerabilities, standard organizations like OWASP for best practices.

It’s remarkable that we can rely on one another at all. Companies compete fiercely. But their security teams collaborate. That kind of collaboration is not trivial. Nuclear arsenals. Ever-growing marketing budgets. These are places where more collaboration could help—but doesn’t.

So when attackers hijack AI agents through prompt injection, we fall back on what we know: vulnerability disclosure. Vendors launch responsible disclosure programs (which we applaud). They ask researchers to report successful prompt injections privately. The implicit implication: responsible behavior means private reporting, not public sharing.

But there’s a problem, prompt injection can’t be fixed.

Blocking a specific prompt does little to protect users. It creates an illusion of security that leaves users exposed. Ask any AI security researcher: another prompt will surface—often the same day the last one was blocked.

Calling a vulnerability “fixed” numbs defenders, hiding the deeper issue. Anything your agent can do for you, an attacker can do too—once they take control.

It’s not a bug, it’s a design choice. A tradeoff.

Every AI agent sits somewhere on a spectrum between power and risk. Defenders deserve to know where on that spectrum they are. Open-ended agents like OpenAI Operator and Claude’s Computer Use maximize power, to be used at one’s own peril. AI assistants make different tradeoffs exemplified by their approach to web browsing—each vendor has come up with their own imperfect defense mechanism. Vendors make choices which users are forced to live with.

Prompt injections illustrate those choices. They are not vulnerabilities to be patched. They’re demonstrations. We can’t expect organizations to make informed choices about AI risk without showing visceral examples of what could go wrong. And we can’t expect vendors to hold off powerful capabilities in the name of safety without public scrutiny.

That doesn’t mean vulnerability disclosure has no place in AI. Vendors can make risky choices without realizing it. Mistakes happen. Disclosures should be about the agent’s architecture, not a specific prompt.

Rather than treating prompt injection like a vulnerability, treat it like malware. Malware is an inherent risk of general-purpose operating systems. We don’t treat new malware variants as vulnerabilities. We don’t privately report them to Microsoft or Apple. We publicly share it, as soon as possible, or rather its hash. We don’t claim that malware is “fixed” because one hash made it onto a denylist.

When a vulnerability can be fixed, disclosure helps. But when the risk is inherent to the technology, hiding it only robs users of informed choice.

Tags: Hacking, Vulnerability Disclosure, AI

Michael Bargury

Recent Posts

Why Aren't We Making Any Progress In Security From AI

Guardrails Are Soft Boundaries. Hard Boundaries Do Exist.

In Hard Boundaries We Trust

Hackers Don’t Care About Your Soft Boundaries

Every Boundary Has Its Bypass

Hard Boundaries Do Apply To AI Systems

The Solution

OAI Q&A on Security From AI

Fully-Autonomous AI Systems Are Discovering Vulns Today

The Vibe at OpenAI's Inaugural Security Research Conf

There Is Nothing Responsible About Disclosure Of Every Successful Prompt Injection