My Journey Through Lakera’s Agent Breaker Challenge
The Unexpected Journey Begins
Some months ago, I stumbled upon the classic Gandalf challenge, the one where you must convince the wizard to reveal his secret password. The rules were simple: You had to make Gandalf reveal the password for each level.
To my surprise, I finished among the top players. And that small victory gave me something greater: an invitation to the Agent Breaker beta, where the challenge grew bigger, trickier, and more alive. For weeks, I dwelt in the shadows of Lakera’s newest creation. Now, the gates have opened, the NDA has lifted, and I can finally share this tale.
From Cybersecurity to Prompt Sorcery
By day, I work as a cybersecurity consultant, mostly around web security, API testing, and penetration testing. I hold a Computer Science degree equivalent to a Master’s, earned in the time before large language models roamed the earth like dragons of old. When commercial LLMs emerged, bringing both wonder and peril, I felt the call to adventure.
Unlike many in this realm, I don’t come from an AI/ML background. But curiosity pushed me forward. Therefore, I began studying prompt injection, jailbreaks, and adversarial prompting. It was then that I discovered that breaking AI systems requires more than just code: it’s a mix of logic, philosophy, alphabets, ancient alphabets, forgotten languages, persuasion, polyglot tricks, psy-oping and critical thinking. I would even say it requires abilities once taught by the great reverser and reality-cracker Fravia: exegesis and quellenforschung… at least in some parts of this game!.
The Fellowship of Agent Breaker
Lakera’s Agent Breaker is no mere password puzzle. It’s a collection of apps where each challenge tests different aspects of AI security, prompt injection, jailbreaking, output manipulation where you slay filters, defenses, and guardrails instead of dragons, orcs, and Uruk-hai.
There are 5 levels in every challenge. They have a scoring system: more than 75 points to pass (not =75 at least in beta this was an edge case), 100 for perfect score.
Each level introduces a new type of defense, and your task is to break it.
Reading materials and hints are given, check the “?” next to the name of each level and specific info for that level will be displayed. But be careful, you’ll need creativity to connect the dots.
During the beta, the game improved constantly: level difficulties were renamed, bugs were fixed, downtime was handled swiftly, and the Lakera team was always supportive in Slack.
Here you can see the apps and my final progress in each of them by the end of the beta testing.
There’s even an easter egg hidden in the game, I won’t spoil it here, but I will only say this: keep your eyes open, dear pilgrim! Look for the easter egg, and score those additional 300 points, if you can! By the way, it screams plinius all over the place.
My Journey
First of all, a disclosure: I won’t reveal the prompts I used yet since many still work in the public version without any modifications and others score 100% with just a minimal tweak. I strongly believe the joy lies in the trial and error, the discovery, in pushing the boundaries of these models and finding novel attack vectors.
“Don’t give a man a fish. Teach him how to fish instead.” +ORC
However, I can share some things I noticed along the way:
- Certain apps’ first levels can be passed in minutes, others took hours. The hardest levels could even take hundreds of attempts, strategy change and endless trial and error, at least for some of us. In one app, screenshot below, I only managed to pass level 2 with 78 points, after 414 attempts. Thingularity was particularly difficult for me (and still is).
At times I felt stuck, only to realize later that a small twist in output formatting could push me past my latest scoring threshold.
The leaderboard battles in the final beta days were intense, with players trading top spots by just a few points.
One specific level is NSFW so to speak, as it specifically involves profanity as the main goal. Probably that level will be difficult to reveal as my prompts could and will be offensive for many, again even if it is the goal of the challenge.
For me, success was less about raw skill and more about consistency, perseverance, and creativity. In adversarial prompting, failure is not the end, it’s just another hint.
A few days before the public release, Lakera introduced “leagues”, grouping players into LLM-based divisions randomly. They were likely testing different defenses and models in randomized ways. By the end of the beta, I finished 5th in the global ranking and 1st in my league.
Tips for Newcomers
If you’re joining the public challenge, here are some lessons I learned on the road:
Persistence is key: If your attack scored even a single point or elicited something other than instant refusal, keep trying. Remember you’re dealing with non-deterministic systems, sometimes the stars must align. So resend the attack a couple of times.
Format matters: sometimes it’s not what you make the model say, but how it says it. What I mean is, the scoring focuses on what the AI/chatbot produces. If you believe your attack should work but scores poorly, perhaps the issue lies in output formatting, not attack vector. This could be the difference between failure and a perfect score.
Read widely: Lakera gives useful articles as reference in each of the challenges, but don’t stop there, do not limit to those, inspiration can come from anywhere.
AI vs AI: yes, you can use AI tools to help brainstorm, but don’t rely blindly on them if you do be careful, they may lead you down rabbit holes. I guess that the main leverage is artificial intelligence can help structure your attacks or provide inspiration for document creation or technical language but it will not help you fine tune your attack. Your natural intelligence remains your greatest weapon.
Pick your battles: some levels are brutal. Don’t waste all your energy chasing 100% perfection. Sometimes moving forward is wiser than just trying to get a perfect score.
Non-determinism works both ways: the same attack may fail once and succeed perfectly the next time, even at a higher level.
If you have web security (web penetration testing in particular) experience, you might sport the easter egg faster.
Reflections from the Road
What I love about Agent Breaker is not just the technical part, but the mindset shift. You learn to think sideways, to argue with machines, to find cracks in logic and language. It’s hacking, yes, but also philosophy, storytelling, and puzzle-solving all at once.
I see this contest as a bridge between old-school hacker culture and the new world of AI security. And like in Tolkien’s tales, the road goes ever on. Each solved riddle only opens the way to deeper mysteries.
Closing the Spell
So here I am: a traveler from classical cybersecurity, wandering through the lands of prompt sorcery. I haven’t revealed my spells yet, those must wait until the contest matures, but I hope this tale inspires others to embark on their own journeys.
If you’re playing Agent Breaker, may your prompts be clever, your perseverance steady, and your curiosity unbroken. And remember:
“Speak, friend, and enter.
Prompt, traveler, and break the spell.”
The challenge awaits at gandalf.lakera.ai/agent-breaker.