🎩 Top 5 Security and AI Reads - Week #8
GenAI in cyber operations, obfuscated activations bypass LLM defenses, vulnerability prioritization challenges, machine unlearning backdoor, and comprehensive large model security/safety framework
Welcome to the eighth installment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're kicking off with a literature review of generative AI in offensive cyber operations, followed by fascinating research on bypassing LLM latent-space defences through obfuscated activations. We'll then explore a comprehensive survey of vulnerability prioritization challenges, examine an innovative backdoor attack leveraging machine unlearning, and conclude with a monster of a paper that details the research landscape associated with Large Model Safety/Security.

This weeks Stable Diffusion special gives me Neuromancer vibes! Wintermute is coming 😱
Read #1 - Generative Artificial Intelligence and Offensive Cyber-Operations
💾: N/A 📜: Stanford Website 🏡: White paper/Pre-print
This paper is a good introduction for cyber defence and offensive folks alike to get their heads around how AI could be leveraged for offensive cyber operations.
Commentary: I don’t have too much to say about this other than it’s a great primer and introduction. It majors on how generative AI could be used to enhance normal offensive cyber operations against non-AI systems. I think (and the articles in the last few weeks probably show) that there is a new class of Tactics, Techniques, and Procedures (TTP) related solely to AI stuff, which could do with a similar style literature review.
Read #2 - Obfuscated Activations Bypass LLM Latent-Space Defenses
💾: Github 📜: arxiv 🏡: Pre-Print
This paper should be an interesting read for folks interested in how to bypass common prompt injection defences such as Sparse Auto-Encoders (SAEs), Linear Probes, and Out of Distribution (OOD) detectors.

Commentary: This paper was a grand read. It is worth prefacing my commentary below with a bit of detail about the threat model. The strength of the threat model in this paper is typically very high—an attacker has control of the entire training process, or the attacker has logit-level access to the model as well as the output of the linear probe (or monitoring approach), for example. This threat model, therefore, should not sound any alarm bells to folks deploying LLMs within services (unless the attacker knows your model and defensive strategy/monitoring).
The research itself looks like it was conducted using Llama-3.1-8b-Instruct. It would be interesting to see what effect model size had on the results by scaling up the experiments onto the 70B and 405B model sizes, but hey, who has time for that (or the GPU?!). The results suggest that the methods they present are able to identify prompts that result in undesirable behaviour consistently and dodge monitoring approaches basically all the time. The paper presents a brief but fascinating conclusion that supports (but not in a good way) the linear representation hypothesis from mechanistic interpretability — the authors suggest that because their approach can consistently find working jailbreaks, LLM might have a much large number of directions that represent harmfulness than what can be represented by defensive monitoring approaches (like Sparse Auto-Encoders (SAE’s)).
The authors also do the most obvious next step too—create loads of nasty prompts and then train better monitors/detectors. They find that this does not work and adversarial training does not make them any more robust with the same method being able to find even more. I stewed on this for a while but then thought, ‘But the threat model is so strong!’. A defender (or LLM deployer) could leverage this approach to create a much more robust monitor than before by using this approach to basically generate a much richer dataset. I also have a sneaky suspicion that you could likely catch the prompts created by this method with an NLP-based approach that looks at entropy or something. The authors don’t go into too much detail about what an example “attack” prompt looks like. The paper ends with a backdoor example, which I personally didn’t find too interesting, but if it floats your boat, it looks good!
Read #3 - A Survey on Vulnerability Prioritization: Taxonomy, Metrics, and Research Challenges
💾: N/A 📜: arxiv 🏡: Pre-print (looks like ACM target publication)
This paper should be good for folks who are interested in tracking the research associated with vulnerability prioritisation as well as folks looking to research this space and want to get up to speed quickly with the potential open research questions.
Commentary: I will be honest and say when I finished reading this paper, I was somewhat sad—given the amount of money, time, and energy that has gone into this problem, it really feels like we haven’t moved anywhere over the last 5 years! This paper does a good job of shining a light on all of the warts in this space. CVSS is pervasive but not wholly representative of why I should really care, compliance/SLA’s driving remediation timelines/priorities, and the perennial problem of no single sources of the truth.
The author's conclusion does a grand job of suggesting where we need to get to:
Our analysis highlights the growing need for adaptive, scalable, and context-aware metrics that integrate real-time threat intelligence and dynamically adjust to evolving threats. Existing approaches often rely on static models that struggle to keep pace with the rapidly changing cyber landscape. Future research should focus on developing scalable, automated solutions capable of handling the increasing complexity of modern systems, particularly through adversarial intelligence and dynamic prioritization techniques. Additionally, a critical gap remains in the explainability of AI-driven models, as lack of transparency continues to hinder their adoption in operational settings. While many studies focus on individual systems or isolated vulnerabilities, holistic approaches that account for inter-dependencies across systems and networks are necessary for more effective risk management. Addressing these challenges will be essential for advancing the next generation of vulnerability prioritization frameworks that are not only technically robust but also practically applicable across various domains and industries.
Read #4 - ReVeil: Unconstrained Concealed Backdoor Attack on Deep Neural Networks using Machine Unlearning
💾: N/A 📜: arxiv 🏡: Pre-Print
This is a fairly middle-of-the-road paper, which I found actually pretty good. This should be a grand read for folks interested in researching backdoors as well as folks generally interested in further understanding how bonkers AI/ML security is.
Commentary: Similar to the Read #2, the threat model and assumptions in this paper are fairly strong but I don’t think it’s completely nuts. The setup the authors present in this paper is a case where the attacker can poison part of a dataset that is used to train a model but then also submit targeted unlearning requests (i.e., please remove images X, Y, and Z from the dataset). The authors then assume that the model trainer uses SISA unlearning to get the model to “unlearn” the requested samples. This seems fairly far-fetched to me with a few assumptions (like, is a model provider going to actually unlearn your thing, but also how did you know they used your dataset?).
With that being said, the concept of the backdoor is cool. When a model is trained on the poisoned data, the backdoor is basically complete junk (i.e., the Attack Success Rate (ASR) is very low) until the targeted unlearning samples are processed/unlearned. Once complete, the ASR of the backdoor then increases significantly. The authors experiments cross a range of models, datasets, backdoor triggers, and backdoor detection approaches—as in most academic papers, it is obviously awesome! I was trying to think of a case where this might actually be feasible, and I could imagine this might be an issue if there is an equivalent of GDPR's Right to be Forgotten but for AI models/training data. The attacker still needs to know a lot to make this work, though.
Read #5 - Safety at Scale: A Comprehensive Survey of Large Model Safety
💾: GitHub 📜: arxiv 🏡: Pre-Print
This paper is an absolute monster and is jam-packed full of useful material. If you want a broad and fairly deep romp through large model safety (as well as security), read this.

I have not had a chance to read this from top to bottom (it’s the best part of 40 pages!), but I found section 8 titled “Open Challenges” an insightful read, and this particular part made me really happy (albeit the treat of All We Need is getting on my nerves!):
While the attack success rate (ASR) is a commonly used metric in safety research, it mainly quantifies how often an attack disrupts a model’s output. However, this metric overlooks several important factors, such as the severity of the disruption, the model’s resilience to various types of attacks, and the real-world consequences of potential failures. A model could still cause harm or mislead decision-making even if its core functionality appears unaffected. For instance, an attack might subtly alter a model’s decision-making process without causing an obvious malfunction, but the resulting behavior could have catastrophic effects in real-world applications. Such vulnerabilities are often missed by traditional metrics like ASR or failure rate. To better understand a model’s weaknesses—whether in its design, training data, or inference process—it is crucial to define multi-level, fine-grained vulnerability metrics. A more comprehensive safety evaluation framework should consider factors such as the model’s susceptibility to different types of attacks, its ability to recover from malicious inputs, and the ethical implications of potential failure modes.
ASR is dead. Long live…oh wait, we don’t have any alternatives.
That’s a wrap! Over and out.