🎩 Top 5 Security and AI Reads - Week #16
Agentic automated cyber ranges, LLMs classifying CVEs, loss functions in deep learning, trajectory-based data poison detection, and RAG compiler fuzzing.
Welcome to the sixteenth instalment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're kicking off with a grand approach to cyber range automation through agentic RAG systems, showing how LLMs can transform cybersecurity training environment generation. Next, we look at interesting research on whether LLMs can correctly identify CVEs and create CVSS vectors, showing their strengths with clear criteria while pointing out difficulties with subjective judgements. We then dive into a comprehensive 172-page survey of loss functions and metrics in deep learning—an invaluable reference for anyone training neural networks. Following that, we examine a clever technique for detecting poisoned training samples by analysing loss trajectories through spectral analysis, providing a novel defence against third-party data manipulation. We wrap up with an exploration of how RAG-based systems can be used for compiler fuzzing, successfully generating test cases that identified actual bugs in cross-architecture compilation workflows.

A note on the images - I ask Claude to generate me a Stable Diffusion prompt using the titles of the 5 reads and then use the
Stable Diffusion Large spaceFLUX.1 [dev] on HugginFace to generate it.
Read #1 - ARCeR: an Agentic RAG for the Automated Definition of Cyber Ranges
💾: N/A 📜: arxiv 🏡: Pre-Print
This is a grand read for anyone interested in the creation of cyber ranges and how this could potentially be offloaded to an automated LLM-based system.
Commentary: This paper seeks to use LLM agents to generate cyber range definition files. In the case of this paper, CYRIS environment definitions. This is done by using a combination of the ReACT prompting strategy and a RAG-based system that has all of the relevant documentation for the CYRIS environment definitions and associated APIs as well as a CYRIS environment definition verifier. A user provides an input like “Give me a network of 2 Linux machines connected with a single switch” (or something like that), and the system generates the corresponding CYRIS environment definition. Very cool, tbh!
The authors find that the key secret sauce to all of this is two things. Firstly, good inputs (in this case documentation of the CYRIS software and environment schema) make RAG systems better — who knew docs could help LLM agents and maybe even humans use software? 😂 And secondly, the schema verifier. Having a verifiable output schema seems to improve the results significantly because useful errors can be provided to the agent to rectify/fix.
The evaluation is pretty strong too, albeit a bit opaque, as it’s not immediately clear what the environments are that are being created. I look forward to seeing where this agentic cyber range stuff goes. I feel like it’ll come back to the usual stuff – how to create a Windows VM quickly and how do I get a breadth/range of boxes in my cyber range network?
Read #2 - Can LLMs Classify CVEs? Investigating LLMs Capabilities in Computing CVSS Vectors
💾: GitHub 📜: arxiv 🏡: Pre-Print
This paper will be a grand read for folks that are responsible for classifying/filing CVSS scores as part of CVE or folks who are interested in automated vulnerability management.

Commentary: This is a short paper which likely represents the start of a longer research project but is still a grand read. The key question this paper is seeking to answer is, “Given a CVE description, can an LLM generate the CVSS scoring vector?” The short answer is yes, but only for the objective bits. The paper compares LLM-based generation against an embedding version. The embedding version (I think) embeds a given CVE into an embedding vector, which is then put into a LightGBM model. The output of this model is a bit unclear in the paper, but let's assume the authors have nailed it!
The authors experiment with various different prompting strategies and the inclusion of different supporting information, such as example CVSS vectors and CWE information. These help a little bit, but the key finding is:
Our experiments demonstrated that while LLMs can achieve high accuracy on certain CVSS elements, particularly those with more objective criteria, they struggle with more subjective dimensions compared to traditional embedding-based methods.
The authors propose exploring a hybrid method that combines the LLM-based method with an embedding version. Almost sounds like it’s going to be another RAG approach! I was thinking about how else this could be improved, and I think the biggest thing would be to include the vulnerable code snippet/file itself (if available). This would help the LLM contextualise the CVE and likely help it generate some better results for the subjective criteria.
Read #3 - A comprehensive survey of loss functions and metrics in deep learning
💾: N/A 📜: Springer 🏡: Artificial Intelligence Review
This is a grand read for folks who are interested in the loss functions used to train a huge range of models as well as the metrics for then evaluating them.
Commentary: I have been using it as a reference during my part-time PhD. The paper's biggest strengths are a) how comprehensive it is (in terms of the range of losses/functions included) and b) how each of the loss functions is presented with a short introduction, the key equation(s), a short bullet point list of key properties/benefits/drawbacks and then sometimes a comparison with other losses. It groups losses and metrics into tasks/types of ML, like regression losses/metrics and then classification losses/metrics, but then specific task types like image classification.
If you are in the game of training networks, I’d definitely check this out. At 172 pages, it is an absolute monster but packed full of useful, actionable stuff.
Read #4 - Try to Poison My Deep Learning Data? Nowhere to Hide Your Trajectory Spectrum!
💾: GitHub 📜: arxiv 🏡: Network and Distributed System Security (NDSS) Symposium 2025
This is a grand read for folks looking for approaches to protect against 3rd party (like external labellers) data poisoning.
Commentary: This paper is seeking to provide a method of identifying poisoned samples within a dataset. You may not have done the labelling yourself. The threat model assumes that someone doing the labelling has added some poisoned samples into the dataset. The way they go about this is pretty clever but has some drawbacks.
The method itself relies on the collection of what the authors call the loss trajectory. The loss trajectory is the loss values for each training sample during a training run of N epochs (basically a big list of floating-point numbers). Rather than take the entire loss trajectory, the loss trajectory is truncated/snipped once the validation loss hits a defined threshold. This loss trajectory is then put through an autoencoder to create an embedding before being transformed using a fast Fourier transform. This transformed representation is then clustered using DBSCAN to identify poisoned samples. If you have just read the above and thought, “Wow, that is complicated,” I did too. I am not 100% sure I fully understand how the various transforms take a loss trajectory from a big list to clustered points, but I’ll take the author's word for it and dig into the code at some point.
At a high level, I like the idea of using loss as a way of spotting poisoned samples, but this approach leaves me with one main question – could this just be done post-training with a pre-trained model instead? This could still provide you with the outlier losses to cluster but save you the cost of having to train.
Read #5 - RAG-Based Fuzzing of Cross-Architecture Compilers
💾: N/A 📜: arxiv 🏡: Pre-Print
This is a great read for folks that are interested in seeing how LLMs and RAG-based systems can help with fuzzing/automated test case generation.
Commentary: I enjoyed this paper because its focus is on finding bugs within compilers using source code as input. The approach itself combines LLMs with RAG to generate automated test cases for a compiler Intel has for SYCL code. The approach itself is fairly workaday and not too novel, but for compiler bug finding, I think it is novel. It generates 100+ SYCL code samples, and through differential fuzzing (same input, different targets, which you’d expect to have the same output; if the output is different, a bug has been found!), the authors find 1 patchable bug, which is pretty cool! A section that hides a bit in plain sight is section V that outlines limitations and future work. This is a wee goldmine of interesting extensions to the proposed framework, and hopefully they will be published too (with the source code next, please).
That’s a wrap! Over and out.