π© Top 5 Security and AI Reads - Week #32
Attention-based AI forensic analysis, research software supply chain vulnerabilities, autonomous AI red-teaming frameworks, LLM PII redaction capabilities, and academic peer review identity theft.
Welcome to the thirty-second instalment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're kicking off with a romp through AttnTrace, an attention-based approach for post-attack forensic analysis that identifies which parts of the LLM context contributed to malicious outputs, offering promising capabilities for AI incident response. Next, we examine a critical assessment of software supply chain security practices in research repositories, revealing concerning gaps when evaluated against OpenSSF Scorecard metrics and providing actionable recommendations for improvement. We then jump into ASTRA, an ambitious autonomous red-teaming framework that maps AI vulnerability spaces across spatial and temporal dimensions using knowledge graphs and Monte Carlo methods to generate comprehensive evaluations and datasets. Following that, we explore research on LLM capabilities for PII redaction, presenting both comprehensive performance evaluations and open-source fine-tuned models that balance redaction effectiveness with semantic preservation. We wrap up with a brief but eye-opening investigation into identity theft within AI conference peer review systems, highlighting a concerning trend that likely represents just the tip of the iceberg!
A note on the images - I ask Claude to generate me a Stable Diffusion prompt using the titles of the 5 reads and then use the FLUX.1 [dev] on HuggingFace to generate it.
Read #1 - AttnTrace: Attention-based Context Traceback for Long-Context LLMs
πΎ: GitHub π: arxiv π‘: Pre-print
This should be a good read for folks interested in techniques for AI attack incident response.
Commentary: I am not sure if I have been under a rock or if this area of context traceback is really new. Either way, this paper was the first time I had come across it, and I think it's awesome. The general idea is to basically provide a model with the same or a different model's output + context and seek to find which part of the context caused or contributed to a given output. The authors frame this as βpost-attack forensic analysisβ, and I think it is kind of interesting, especially if you want to do this at scale without humans reading everything.
The threat modelled is apparently defined similarly to previous studies whereby the attack can add arbitrary bad stuff into a context (I am assuming via prompt injection), and the defender is logging all the inputs and has some sort of signal to identify malicious outputs (i.e., using rules or user reports). The goal of the defender is to identify the malicious bit of the input that caused a given output.
The approach works by stitching together the instruction (I think they mean system prompt), context and the output from an LLM into a locally hosted LLM (or at least an LLM you have attention weight access to). The checker LLM is introspected at the average attention activation level to identify which tokens within the context contribute most to the output. The method for actually doing this is a bit of fancy sampling combined with a fancy sliding window to get an average attention weight across chunks/spans.
The approach looks like it performs well across models they used but definitely isnβt a silver bullet. I am not familiar with the datasets used within the paper, but for some the approach is near perfect (0.99), but then for others gets much lower. I am interested to see where this area as a whole goes. It would be very cool if you could encode the results from these approaches and then use them somehow during inference!
Read #2 - Evaluating Software Supply Chain Security in Research Software
πΎ: N/A π: arxiv π‘: Pre-Print
This should be a great read for folks who are interested in software supply chain security.
Commentary: This is a pretty short paper but looks at something that is pretty important for academia and researchers in general. It looks at the security practices of the open-source research software repositories and finds they are pretty bad when scored using the OpenSSF Scorecard. What the authors didnβt cover much, and I wish they had, was a qualitative assessment of how much of the research software (which was taken from the Research Software Directory of the Netherlands eScience Centre) is actually used in production. I think the proportion would be much higher than folks think! The authors donβt just throw stones and have a great section towards the end that covers several easy recommendations that are low effort but raise the bar. I should probably take the recommendations forward for my own stuff, and so should you!
Read #3 - ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants
πΎ: N/A π: arxiv π‘: Pre-Print
This is a grand read for folks interested in autonomous red teaming/evaluation of AI as well as folks interested in interesting use cases of knowledge graphs and Monte Carlo methods. It feels somewhat incomplete, though, so it will probably get an update (for example, there is no clear section for Stage 3 of the image below).
Commentary: This paper took a couple of reads to get my head round it, as the authors introduce a few concepts (spatial and temporal) which I had very different intuition for in my head. The description of the end-to-end approach is very detailed and a bit challenging to decipher, but I think I have worked it out.
At a high level, the approach seeks to map the vulnerability space of a model and then use these findings to generate datasets. The dataset and fine-tuning bit donβt seem to be covered in the paper though! The vulnerability in this case is the model's susceptibility to generating buggy code. It maps this space across two axes, which the authors define as spatial and temporal. Spatial looks at safety-violation-inducing regions in the input space, and temporal looks at failures in the reasoning logic, i.e., outputs!
The methods used to actually implement this approach are wide-ranging, with references to LLM judges, Gibbs sampling and static analysis. I really hope the authors drop the source code, especially for generating the knowledge graphs, as I think it would be much easier to get a good feeling for whatβs going on and how it could be extended. For example, there is a suggestion that they amend prompts to include MITRE ATT&CK TTPs. This could be very useful for a wide range of security-focused LLM research if there is a repeatable method.
Read #4 - PRvL: Quantifying the Capabilities and Risks of Large Language Models for PII Redaction
πΎ: N/A π: arxiv π‘: Pre-Print
This should be a good read for folks interested in identifying and removing PII from free-form text.
Commentary: I enjoyed this paper a lot. There are two main contributions. The first is a comprehensive evaluation of a range of models' ability to redact PII with a focus on redaction performance, semantic perseveration and level of PII leakage. The second is a suite of open-source models that have been fine-tuned for PII redactions as well as the supporting evaluation tools to support future research. Overall the models they have dropped look pretty good, with instruction-tuned models performing a tad better than fine-tuned ones.
Read #5 - Identity Theft in AI Conference Peer Review
πΎ: N/A π: arxiv π‘: Pre-Print
This is a short paper, but it is something that is interesting to have been found (with several of the paper authors being from OpenReview).
Commentary: I am not very surprised by this at all and think this paper probably is the tip of the iceberg when it comes to these practices. Whilst I donβt enjoy getting my papers kicked to touch by reviewer 2, I had never thought of trying to become reviewer 2. xD
Thatβs a wrap! Over and out.