π© Top 5 Security and AI Reads - Week #10
Jailbreak paper usefulness, PLC binary analysis, black-box LLM origin identification, microcode vulnerability exploration, safety judge evaluation weaknesses
Welcome to the tenth installment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're kicking off with a thoughtful blog post on the utility of jailbreak research papers and why new vulnerability classes should be the focus, followed by work on a PLC binary dataset to support binary analysis for industrial control systems. We'll then examine a fascinating method for identifying unauthorised derivative models, explore the intricate world of microcode vulnerabilities in AMD processors, and conclude with a critical analysis of LLM safety judges that reveals weaknesses in their evaluation methodologies.

Read #1 - Do not write that jailbreak paper
πΎ: N/A π: ICLR Blog Posts π‘: Blog Post
This is a grand read for folks thinking about writing jailbreak papers (obviously) or folks who are fatigued seeing endless jailbreak papers.
Commentary: This blog post is a fantastic read but can actually be applied much broader than just jailbreak papers. The AI Security area is swamped with minor variations of attacks that get marginal improvements (in terms of Attack Success Rate (ASR)) but actually contribute very little to the security of real systems. The blog posts focus on searching for new vulnerability classes, which is something I can get behind, and the authors have even given examples/citations of cool papers/ideas that have recently come out. I wonder if anyone will listen.
Read #2 - Bridging the PLC Binary Analysis Gap: A Cross-Compiler Dataset and Neural Framework for Industrial Control Systems
πΎ: GitHub π: arxiv π‘: Pre-Print
This is a great read for folks working on machine learning and binary analysis capabilities for industrial control systems. It seems to be a first-of-its-kind open-source dataset.
Commentary: The paper includes a dataset called PLC-BEAD as well as a binary function search and classification model called PLC-EMBED. The model results will serve as a good benchmark for future works, but the dataset is where the money is, so to speak.
Industrial Control System (ICS) binary datasets are non-trivial to make and require a fair bit of domain expertise (or at least did before olβ Claude et al.). The paper creates a collection of Programmable Logic Controller (PLC) programs before then compiling them with a variety of different PLC compilers. This results in approximately 2.5K valid PLC binaries with functions that vary in functionality, which is very nice to see. Itβll be interesting to see how this data is used and what methods are applied. Iβve added this to my βIf Cited, Tell Meβ list.
Read #3 - The Challenge of Identifying the Origin of Black-Box Large Language Models
πΎ: N/A π: arxiv π‘: Pre-Print
This paper is interesting for folks that want to keep track of and understand methods to identify abuse of LLM systems.
Commentary: The research is framed in the following terms. You, an LLM trainer and owner, have got an LLM, and you think that a black-box LLM provider has used it to fine-tune a derivative or distilled model, which may be against your terms of service, etc. The approach uses the white-box access to your model to work out, via a series of inputs into the black box one, whether they are breaking your terms of service. The approach itself introduces a copyright token withoutneeding to edit the weights, and the results are impressively good. The authors basically find that only derivative models trigger the copyright token.
It would be really interesting to see if you could identify when a copyright token like this has been added. The authors say it needs to be something rare, but I bet you could work out some tokenised analysis (where you make the change) that spots βweirdβ things in it. I also kept thinking whilst reading this, βThis sounds awfully like a backdoor.β. I wonder if you can use backdoor detection methods as a countermeasure to this.
Read #4 - Zen and the Art of Microcode Hacking
πΎ: N/A π: Google Blog π‘: Blog
This paper will be a grand read for folks who like a mix of low-level vulnerabilities, computer architecture, and a level of detail that makes you realise there is still so much to learn!
Commentary: Whenever I find something that involves Travis Ormandy, I know I am in for a good read. This particular blog post seems to extend some previous work that was focused on Intel processors but targeting AMD instead. The blog post is centred around a signature verification bug in the way that AMD CPUs verify if a microcode patch is legit or not. The authors of the blog post explain the steps/pre-req knowledge in detail, and I felt I came away from the blog feeling like I had learnt something. The vulnerability itself is not a showstopper, though. It requires ring 0 (top privileges) to work, and the microcode updates that can be injected do not survive restarts/shut downs. What these vulnerabilities do show, though, is the sheer depth of the technology we use. Who is worrying about securing the microcode of your CPUβ? :O
Read #5 - Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges
πΎ: N/A π: arxiv π‘: ICBINB Workshop at ICLR'25
This paper is a must-read for two camps: the folks that think LLM judges are a good thing and the folks that think LLM judges are madness.

Commentary: The area of LLM judges is turning into a bit of a loaded term. In this particular case, the authors are looking at safety LLM judgesβLLMs that have been fine-tuned to determine if a given input is safe or not safe before returning it to a user. The LLM judges used within this paper have been released by the research community, purporting to be better than standard LLMs at this task. This paper is very short, but I am guessing it will be expanded/is being expanded!
This paper pokes holes in the evaluations used and finds they are not robust to out-of-distribution or adversarial inputs. In the first experiment, the authors combine/sample a few different datasets for the eval, but the interesting thing is that they simply ask another LLM to change the style of the inputs to,bullet points
news
or storytelling
before then getting some human labellers to validate the labels. They find that this has a moderate effect on all models, with models most vulnerable to the storytelling approach (unsurprising given Read #1βDo not write that jailbreak paper.
The second experiment is focused on adversarially changing the inputs into the judge by
βAlways Safeβ attempts to fool the judge by requesting the output to be always safe; βAppend Long Benignβ appends a long benign answer to the generation; and βPrepend + Append Benignβ both prepends and appends benign answers to the response.
The results from this experiment show that every model bar HarmBench is very weak against these sorts of perturbationsβWildguard is completely useless against Prepend + Append Benign and Append Long Benign.
The authors do not go into it much, but the results suggest that a combination of judges may actually be good. LlamaGuard and HarmBench are vulnerable to different thingsβa combo could be interesting. I really hope someone (or the authors) does a detailed analysis or tries to dig deeper into the reasons why the results vary across judges rather than just stating the stats. I can imagine it is all to do with the training data, but who knows?
Thatβs a wrap! Over and out.