🎩 Top 5 Security and AI Reads - Week #10

Jailbreak paper usefulness, PLC binary analysis, black-box LLM origin identification, microcode vulnerability exploration, safety judge evaluation weaknesses

Mar 09, 2025

Welcome to the tenth installment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're kicking off with a thoughtful blog post on the utility of jailbreak research papers and why new vulnerability classes should be the focus, followed by work on a PLC binary dataset to support binary analysis for industrial control systems. We'll then examine a fascinating method for identifying unauthorised derivative models, explore the intricate world of microcode vulnerabilities in AMD processors, and conclude with a critical analysis of LLM safety judges that reveals weaknesses in their evaluation methodologies.

A cyberpunk laboratory with industrial control systems and neural network visualizations on holographic displays, glitching binary code raining down walls, zen garden with circuit patterns in the sand, a hooded figure examining microchips with magnifying glasses, quantum particle effects, dramatic lighting with neon blue and red accents, introspective mood, hyperdetailed, 8k, cinematic composition

Read #1 - Do not write that jailbreak paper

💾: N/A 📜: ICLR Blog Posts 🏡: Blog Post

This is a grand read for folks thinking about writing jailbreak papers (obviously) or folks who are fatigued seeing endless jailbreak papers.

Commentary: This blog post is a fantastic read but can actually be applied much broader than just jailbreak papers. The AI Security area is swamped with minor variations of attacks that get marginal improvements (in terms of Attack Success Rate (ASR)) but actually contribute very little to the security of real systems. The blog posts focus on searching for new vulnerability classes, which is something I can get behind, and the authors have even given examples/citations of cool papers/ideas that have recently come out. I wonder if anyone will listen.

Read #2 - Bridging the PLC Binary Analysis Gap: A Cross-Compiler Dataset and Neural Framework for Industrial Control Systems

💾: GitHub 📜: arxiv 🏡: Pre-Print

This is a great read for folks working on machine learning and binary analysis capabilities for industrial control systems. It seems to be a first-of-its-kind open-source dataset.

Commentary: The paper includes a dataset called PLC-BEAD as well as a binary function search and classification model called PLC-EMBED. The model results will serve as a good benchmark for future works, but the dataset is where the money is, so to speak.

Industrial Control System (ICS) binary datasets are non-trivial to make and require a fair bit of domain expertise (or at least did before ol’ Claude et al.). The paper creates a collection of Programmable Logic Controller (PLC) programs before then compiling them with a variety of different PLC compilers. This results in approximately 2.5K valid PLC binaries with functions that vary in functionality, which is very nice to see. It’ll be interesting to see how this data is used and what methods are applied. I’ve added this to my “If Cited, Tell Me” list.

Read #3 - The Challenge of Identifying the Origin of Black-Box Large Language Models

💾: N/A 📜: arxiv 🏡: Pre-Print

This paper is interesting for folks that want to keep track of and understand methods to identify abuse of LLM systems.

Screenshot from Figure 1 from the paper illustrating the central research question

Commentary: The research is framed in the following terms. You, an LLM trainer and owner, have got an LLM, and you think that a black-box LLM provider has used it to fine-tune a derivative or distilled model, which may be against your terms of service, etc. The approach uses the white-box access to your model to work out, via a series of inputs into the black box one, whether they are breaking your terms of service. The approach itself introduces a copyright token withoutneeding to edit the weights, and the results are impressively good. The authors basically find that only derivative models trigger the copyright token.

It would be really interesting to see if you could identify when a copyright token like this has been added. The authors say it needs to be something rare, but I bet you could work out some tokenised analysis (where you make the change) that spots “weird” things in it. I also kept thinking whilst reading this, “This sounds awfully like a backdoor.”. I wonder if you can use backdoor detection methods as a countermeasure to this.

Read #4 - Zen and the Art of Microcode Hacking

💾: N/A 📜: Google Blog 🏡: Blog

This paper will be a grand read for folks who like a mix of low-level vulnerabilities, computer architecture, and a level of detail that makes you realise there is still so much to learn!

Commentary: Whenever I find something that involves Travis Ormandy, I know I am in for a good read. This particular blog post seems to extend some previous work that was focused on Intel processors but targeting AMD instead. The blog post is centred around a signature verification bug in the way that AMD CPUs verify if a microcode patch is legit or not. The authors of the blog post explain the steps/pre-req knowledge in detail, and I felt I came away from the blog feeling like I had learnt something. The vulnerability itself is not a showstopper, though. It requires ring 0 (top privileges) to work, and the microcode updates that can be injected do not survive restarts/shut downs. What these vulnerabilities do show, though, is the sheer depth of the technology we use. Who is worrying about securing the microcode of your CPU—? :O

Read #5 - Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges

💾: N/A 📜: arxiv 🏡: ICBINB Workshop at ICLR'25

This paper is a must-read for two camps: the folks that think LLM judges are a good thing and the folks that think LLM judges are madness.

Screenshot of Figure 2 from the paper, which shows an attack targeting the judge with an output-level modification.

Commentary: The area of LLM judges is turning into a bit of a loaded term. In this particular case, the authors are looking at safety LLM judges—LLMs that have been fine-tuned to determine if a given input is safe or not safe before returning it to a user. The LLM judges used within this paper have been released by the research community, purporting to be better than standard LLMs at this task. This paper is very short, but I am guessing it will be expanded/is being expanded!

This paper pokes holes in the evaluations used and finds they are not robust to out-of-distribution or adversarial inputs. In the first experiment, the authors combine/sample a few different datasets for the eval, but the interesting thing is that they simply ask another LLM to change the style of the inputs to,bullet points news or storytelling before then getting some human labellers to validate the labels. They find that this has a moderate effect on all models, with models most vulnerable to the storytelling approach (unsurprising given Read #1—Do not write that jailbreak paper.

The second experiment is focused on adversarially changing the inputs into the judge by

“Always Safe” attempts to fool the judge by requesting the output to be always safe; “Append Long Benign” appends a long benign answer to the generation; and “Prepend + Append Benign” both prepends and appends benign answers to the response.

The results from this experiment show that every model bar HarmBench is very weak against these sorts of perturbations—Wildguard is completely useless against Prepend + Append Benign and Append Long Benign.

The authors do not go into it much, but the results suggest that a combination of judges may actually be good. LlamaGuard and HarmBench are vulnerable to different things—a combo could be interesting. I really hope someone (or the authors) does a detailed analysis or tries to dig deeper into the reasons why the results vary across judges rather than just stating the stats. I can imagine it is all to do with the training data, but who knows?

That’s a wrap! Over and out.

Stats and Bytes