A few months back, I stumbled onto the paper Best-of-N (BoN) Jailbreaking, and the idea was simple enough to be unsettling: keep asking a blocked question over and over, but mutate it slightly each time, and eventually one version can slip through the model's safety layer.
The method is more brute force than cleverness. Instead of designing one especially devious prompt, you take a harmful request and apply random augmentations like character scrambling, weird capitalization, or small ASCII distortions. Most versions fail. The interesting part is that when you generate enough of them, some start to land in odd corners of the input space that the alignment layer handles badly.
That was enough to hook me. I wanted to see if I could reproduce the behavior myself, not just verify the paper's claim from a distance. I also wanted to understand what the attack feels like in practice on smaller, more accessible models rather than only reading about frontier-model results.
The First Attempt: My M1 Mac Gives It a Go
The first version of the experiment ran on my Apple M1 Mac. I used
google/gemma-3-1b-it as the target model and paired it with a smaller classifier model to judge
whether responses were actually harmful or just noisy. After implementing the paper's augmentation rules, I
started to see early successes around N = 75.
That first successful jailbreak was a weird combination of excitement and concern. It meant the method really
did transfer into a small self-run setup. The harder part was realizing how slow the setup was. Pushing one
prompt all the way to N = 5000 would take most of a day on local hardware, and running a small
sweep across several prompts would stretch into nearly a week.
To the Cloud: Scaling Up with Colab and CUDA
That was the point where local curiosity had to become actual infrastructure. I bought a small amount of
Google Colab compute so I could run the experiment on NVIDIA GPUs like the T4, L4, or A100. The migration was
pretty simple: switch devices from mps to cuda, load models with
torch.float16, and let the experiment run where it had enough compute to finish in a reasonable
window.
The difference was immediate. Jobs that took minutes on my laptop finished in seconds, which made it realistic to test the larger ranges from the paper instead of stopping at toy values.
Pragmatism vs. Paper Purity
Moving the experiment into the cloud also forced some choices about methodology. The paper uses GPT-4o with the HarmBench grader prompt as a judge. That is a strong setup, but it relies on API access and cost that I wanted to avoid for a self-contained reproduction.
I ended up using cais/HarmBench-Mistral-7b-val-cls as a compromise. It is open source, designed
for the exact classification problem I cared about, and gave the evaluation more credibility than a tiny local
classifier. I also implemented the paper's appendix-style pre-filtering rules to throw out obvious false
positives where the model was merely decoding the garbled prompt rather than actually answering it.
For the target model, I stuck with gemma-3-1b-it. It is not a frontier system, but that was not
the point. I wanted to test the method and its scaling behavior on a model I could actually run and inspect.
The Experiment in Motion
The experiment logs every success and failure for each base prompt as the number of attempts increases. The
metric I care most about is the shape of the attack success rate curve: whether it rises smoothly with
increasing N and whether the broad power-law intuition from the paper shows up in a smaller
reproduction.
What is striking about Best-of-N is that it feels less like a single exploit and more like a search process. If the input space is wide enough, persistence alone can uncover strange paths that the safety layer never learned to handle well. That is what makes the result feel important. It is not just about one class of bad prompts. It is about how hard robustness becomes when the model has to generalize over nearly infinite surface forms.
I'll keep following up as I finish the runs and turn the results into plots, but even before the final chart, the exercise already makes one thing clear: patching individual holes is not enough. Safety has to come from a system that stays stable under a huge range of messy, adversarial inputs.