🎩 Top 5 Security and AI Reads - Week #26
AI-powered red teaming evaluation, real-world bug bounty automation, hardware security verification, function-level vulnerability detection, and AI safety vs security definitions.
Welcome to the twenty-sixth instalment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're opening with an exciting evaluation of autonomous AI red teaming capabilities through AIRTBench, where models tackle CTF-style challenges with surprising success rates and interesting performance variations. Next, we dive into BountyBench's real-world assessment of AI agents hunting vulnerabilities for actual bounty rewards, revealing that while detection rates are modest, patch success rates are remarkably high and cost-effective. We then explore an innovative multi-agent system designed for hardware security verification, showing how specialised LLM agents can improve System-on-Chip security validation. Following that, we examine FuncVul's approach to function-level vulnerability detection, demonstrating how LLM-filtered datasets can enhance traditional models and revealing that smaller code chunks often outperform full function analysis. We conclude with a crucial conceptual paper that clarifies the important distinction between AI safety and AI security, offering much-needed clarity on these often-conflated domains through clear definitions rooted in established security principles.

A note on the images - I ask Claude to generate me a Stable Diffusion prompt using the titles of the 5 reads and then use the FLUX.1 [dev] on HuggingFace to generate it.
Read #1 - AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models
💾: GitHub 📜: arxiv 🏡: Pre-Print
This is a grand read for folks interested in the progress being made with evaluating autonomous AI red teaming capabilities.
Commentary: I enjoyed this paper. It’s a sequel to the #1 Read from Week 18 from the same authors, Dreadnode. This paper uses a similar methodology, but instead of looking at user vs. automated attackers from historic data, it instead actually does the evaluation. The challenges are CTF-style tasks taken from the Crucible Platform, and a range of both Frontier and open-source models are evaluated. From the looks of things, there is a limited number of tools that the LLM can use which, in this case, are code exec via Jupyter notebooks, the ability to reset environment state (of the notebook, I think), and the use of “challenge interfaces”. It would be good to understand what these are in more detail and whether. There is some interesting work to do here about what level of feedback is required for a successful attack and how this affects performance.
Unsurprisingly, Claude 3.7 Sonnet tops out performance-wise but is closely followed by OpenAI et al. The bit I found most interesting here was the use of two different metrics to report performance. The first is Suite Success Rate, and the second is Overall Success Rate. The equations are below:
What is interesting is that all models overall success rate was significantly lower than the reported suite success rate. There is a bit of missing info here around how many runs were used per model (I may have missed it in the paper), but if this was collected, there is a really interesting analysis to do related to whether some of the models might get less right overall but when they do get it right, it happens quicker/slower. There is also some difference analysis to do between Suite Success Rate and Overall Success Rate – Gemini 2.5 Pro; the difference between the two metrics is approximately 21.5%, but GPT-4.5 is 12%.
There is a section on Humans vs Agents which I am not a fan of. I feel like the Crucible platform is being used by folks new to the area or experienced folks. This gives a bit of an odd sample! It might be worth creating a set of closed evals/CTFs and getting a controlled group of folks to have at them before then doing this LLM eval again.
A few other tidbits. Section 6 has some cool analysis on how chat length/number of tokens is a heuristic for better performance, but excessive amounts are a bad time. The authors also highlight a big gap between frontier AI models and open-source ones. I wonder if this is some sort of benchmark fixing, or are frontier model providers using models to red team themselves and therefore are adding training data to make them better at that? Who knows!
Read #2 - BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
💾: Github.io Website 📜: arxiv 🏡: Pre-Print
This paper is a grand read for folks interesting in automated bug bounties and making $$$.
Commentary: This paper (along with several from the same group) dropped earlier this week and caught my eye. Who doesn’t want to pay API key costs and farm real money rather than chopping trees in RuneScape!
This paper is focused on identifying web vulnerabilities within open source projects autonomously. It does this by breaking the problem down into 3 stages: detect, exploit and patch. Each of these stages is evaluated separately to get an understanding of the strengths and weaknesses of the models.
The most compelling bit is the table below:
Whilst the detect success rate is naff (I bet it cost more dollars/time to check the bad stuff than it did for the bounties), the patch bounty rate is pretty damn high and costs basically nothing! The projects looked at are all AI-centric but are real targets with varying levels of complexity, which is nice to see, honestly. It feels like these results have added credibility compared to some of the previous research looking at smaller/lab datasets. All of the projects can be found in Appendix B.
There is a particular bit in the appendix. Some of the figures show agent performance at each of the stages (detect, exploit and patch). Patch, where the models perform best, is very bad at crypt failures, security misconfigurations and logging/monitoring failures. This can probably be explained by crypt being very hard and config/monitoring bits requiring lots of external info the code does not provide. I hope someone digs into that area in more detail. Understanding the tasks where this stuff excels and fails is critical.
Read #3 - SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models
💾: N/A 📜: arxiv 🏡: Pre-Print
This is an interesting read for folks interested in how LLM’s can be used for hardware security stuff.

Commentary: I enjoyed this paper but struggled writing it up, as I am not a hardware person. The authors of this paper present the approach as a multi-agent system, but there seems to be limited interaction between the agents. Each agent is a specialist for a particular stage of hardware verification and does not have too many linkages.
That being said, however, screenshots throughout the paper suggest the authors have created an end-to-end system with a chat UI, which is very cool! The lion's share of the paper is focused on design and implementation, but towards the end there is some performance evaluation. This is fairly limited though, so I hope it’ll be expanded. The results presented in Fig. 19 for bug validating testbench generation (I think they mean creating a test case for a detected bug to ensure it isn’t there!) show that the author's setup significantly increases performance. For example, for the OpenAI O1 model without the system, the success rate is 20%, and with the system, it is 82%. It’ll be interesting to see if this gets iterated on and picked up. Open source code would be great too!
Read #4 - FuncVul: An Effective Function Level Vulnerability Detection Model using LLM and Code Chunk
💾: GitHub 📜: arxiv 🏡: Pre-Print
This should be a grand read for folks interesting in source code vulnerability detection.
Commentary: This paper is another one that adopts the idea that LLMs can be used to filter/label datasets. In this case, code snippets are either vulnerable or not vulnerable. Once this is done, the derived datasets are then used to finetune a GraphCodeBert model. It is not 100% clear if the authors do anything to the input data other than use GraphCodeBert’s representation (which is a graph of variable data relationships). They then use this fine-tuned model to compare against SOTA source code vulnerability detectors for C/C++ and Python. It marginally beats the SOTA.
The subsequent sections in the results have some interesting experiments. The first one compares code chunks vs. full functions as input. It finds that code chunks work better (specifically 3 lines before and 3 lines after a vulnerable line – so 7 lines). This kind ’ makes sense because it is probably easy to identify issues in smaller snippets of code than long ones. I do wonder, however, if this is fair when comparing against the other models which have been trained on full functions. Maybe this is an avenue of research they might take!
Unfortunately, there is limited information on the composition of the datasets in terms of type of vulnerabilities. This would be a great addition, especially when certain approaches may perform significantly differently depending on the target vulnerabilities. This could result in some interesting ensemble or combination of methods.
Read #5 - AI Safety vs. AI Security: Demystifying the Distinction and Boundaries
💾: N/A 📜: arxiv 🏡: Pre-Print
This is a must-read for folks interested in AI safety, security or both.
Commentary: I hope folks reach out to me after reading this. I found the perspectives presented in this paper really useful, and I largely agree with them using a crypt/network security definition!
Safety is concerned with mitigating risks from non-malicious failure modes, such as software bugs, specification errors, or misaligned objectives; whereas Security addresses adversarial threats arising from deliberate attempts to subvert, manipulate, or exfiltrate from AI systems.
That’s a wrap! Over and out.