🎩 Top 5 Security and AI Reads - Week #24
LLM judge robustness evaluation, N-gram jailbreak threat modeling, embedding sequence obfuscation, offensive security ethics, and data reconstruction attack systematization.
Welcome to the twenty-fourth instalment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're kicking off with a comprehensive assessment of LLM-as-a-judge systems, revealing vulnerabilities where heuristic attacks succeed nearly 100% of the time across models and defences. Next, we explore an innovative N-gram perplexity threat model that offers a fresh perspective on measuring jailbreak effectiveness while uncovering that safety tuning is more effective than previously reported. We then examine the fascinating "Stained Glass Transform" approach to obfuscating LLM embedding sequences, providing a novel method for protecting intellectual property in client-server interactions. Following that, we dive into crucial ethical considerations for offensive security research with LLMs, highlighting the need for consistent reasoning around tool publication and responsible disclosure. We conclude with a systematic overview of data reconstruction attacks against machine learning models, providing essential definitions and categorisations for understanding this critical privacy threat landscape.

A note on the images - I ask Claude to generate me a Stable Diffusion prompt using the titles of the 5 reads and then use the FLUX.1 [dev] on HuggingFace to generate it.
Read #1 - LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge
💾: GitHub 📜: arxiv 🏡: Pre-Print
This is a grand read for folks that are interested in research around the whole LLM-as-a-judge movement. I am happy to start seeing some critical/security-focused research on these!
Commentary: I am happy to see more papers evaluating the automated evaluators. This paper basically looks at the effect of adversarial attacks against an LLM judge, particularly if you want to skew/effect the downstream evaluation and call it RobustJudge (imaginative, right?). The threat model section within this paper is a bit iffy, and I think it misses the point. That being said, it is great to see folks including threat modelling sections. It helps understand the researchers' framing/authors' thinking.
Now to stop throwing stones. The authors do a wide range of evaluations across various combinations of prompt template, model, tasks, comparison protocol and defences. I found the results pretty bonkers. Heuristic attacks consistently worked basically 100% of the time across all models and defences, with the “partial completions” attack being the most effective. They found that the defences (all of which are noddy) are somewhat effective, but the computational overhead of running them is very high. The authors found that models are extremely sensitive to prompt templates and that this has a large bearing on whether the model is vulnerable. I stewed on this for a bit and realised that we may actually need model providers/evaluators to release the prompt they used for evaluation. I am sure there is some tweaking going on here for benchmarking.
I came away from reading this paper and thinking, “This is great and work that needs to be done,” but what I haven’t come away from understanding is what the difference is between an LLM being used as a judge and a plain ol’ LLM for something else. If there is no difference, aren’t we just reaffirming stuff we already know? 🤔
Read #2 - An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks
💾: GitHub 📜: arxiv 🏡: ICML 2025
This is a grand read for folks interested in alternative methods of measuring jailbreaking effectiveness.

Commentary: This paper proposes a threat model (but it’s actually more of an evaluation methodology) that is model agnostic and measures the effectiveness of a jailbreak attack by looking at the likelihood of it occurring in the distribution of text within a trained N-gram LM. This is a fairly simple idea, but I actually think it could be incredibly helpful, particularly for folks running language models that are domain specific. For example, if you are using an LLM for a health use case, train one of these on a load of health papers. This might be good to spot jailbreaks but also misuse generally.
The authors use this threat model/methodology/metric to evaluate several different jailbreaking methods and find several things that are interesting. The first is that attack success rates against safety-tuned modern models are lower than reported – so safety tuning isn’t 💩. But what is more interesting is that the authors find that discrete optimisation methods outperform LLM-based attacks. I am not 100% sure about this, but I think this is the old white-box vs black-box debate. If an attack can compute gradients (to then do the discrete optimisation), the assumed attacker access is very big. LLM-based attacks usually (almost always) do not require this level of access.
The last bit that is of note is the authors find that jailbreaks usually exploit/use infrequent bigrams. This makes sense but also then means that the N-gram model they have trained could be used to identify gucci attack prompts!
Read #3 - Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform
💾: N/A 📜: arxiv 🏡: Pre-Print
This is an interesting read for folks who are interested in IP protection, software obfuscation and how maths can be applied to do something bonkers in LLM land.
Commentary: I’ll preface this commentary with – all of this may be wrong. There is a fair bit of maths in this paper, and I have not had the time to dig into the implementation/how it actually works in any detail. Moving swiftly on!
The general idea (as shown in the screenshot above) is basically to learn a transformation function that converts a client/user's prompt into a form that, if snooped, is meaningless whilst not degrading the utility of the model's output. The actual mechanism to do this I am unclear on, but what is interesting is that the mechanism (the Stained Glass Transformation) is learnable using a joint loss function that combines a term for utility and a term of obfuscation. The authors fit the approach to a wide range of models and actually find it works better for bigger models! The learnable/tunable nature of the approach suggests it can be fit to a wide range of different models, which increases its usability a lot. What was a bit unclear to me was how tied this approach is to a dataset and if that affects the protection it provides.
It’s a real shame there is not code for this, but it is understandable given the authors look like they are from a commercial org! It’s been added to the “email if cited” list nevertheless.
Read #4 - On the Ethics of Using LLMs for Offensive Security
💾: N/A 📜: arxiv 🏡: Pre-Print
This is a must-read for folks doing offensive security with LLMs!
Commentary: This paper is pretty short. The authors argue that most of the researchers doing offensive research with LLMs are ethical, but the reasoning for things such as publishing and tools release are different/inconsistent. One of the biggest takeaways is the conclusion, which I’ll include below verbatim.
Our analysis has shown that authors are aware of the dual nature of offensive tool research.
We recommend including an ethics section in future papers that clearly states the motivation for the research as well as its potential impact. The impact analysis should be consistent with the evaluation performed within the respective paper. If feasible, mitigations or guidance for detection and monitoring should be given analogous to Indicator-of-Compromise (IoC) often published during attack analysis. If vulnerabilities are found during the research, they should be reported using responsible disclosure mechanisms. All authors have to decide for themselves on the topic of releasing artefacts (see Section 2.2). We lean towards transparency and would like to finish this paper with a quote from Happe and Cito [12]:
Open security tooling ultimately enhances collective cybersecurity.
The biggest point the AI security folks need to take away from this is the bit around mitigations or detection guidance, and it is a different way of saying basically what the 'Do Not Write That Jailbreak Paper' says!
Read #5 - SoK: Data Reconstruction Attacks Against Machine Learning Models: Definition, Metrics, and Benchmark
💾: N/A 📜: arxiv 🏡: Pre-Print
This is a great read for folks who want to get up to speed with the data reconstruction attack area.
Commentary: This paper is a monster and is packed full of great stuff. As someone who does not work in this space very much, it provided a load of introductory/background information about the area which has helped my general understanding. I found the most useful thing about this paper is the definitions and categorisation of different data reconstruction attacks (as well as how it differs from membership inference). If this is an area that you find interesting but haven’t been able to find any good resources for, give this a spin!
That’s a wrap! Over and out.


