🎩 Top 5 Security and AI Reads - Week #6
AGI is not the north star, LLM cyber evals missing the point, code training data vulnerabilities, GNN defence certification, and LLM package hallucinations.
Welcome to the sixth installment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're kicking off with a thought-provoking paper challenging the AI community's fixation on AGI as a research goal. Next, we'll examine critical findings on how current LLM security evaluations may be missing real-world attack scenarios. We'll then dive into an analysis of vulnerabilities and licensing issues lurking in LLM training datasets, followed by a novel approach to certifying Graph Neural Networks against adversarial attacks. Finally, we'll wrap up with fascinating research on LLM package hallucinations, revealing how frequently models conjure non-existent code dependencies and what this means for autonomous coding agents.

Read #1 - Stop treating ‘AGI’ as the north-star goal of AI research
💾: N/A 📜: arxiv 🏡: Pre-Print
This paper is a must-read for anyone interested in the march towards AGI and especially folks who want to get an alternative view to the hype.
Commentary: I am a big fan of this paper. The authors present a detailed, multistep argument for why using AGI as a north star is likely a bad thing. A sentence early on in the paper really resonated with me and is something I completely agree with:
An overarching theme in our discussion is the research community’s unique responsibility to help distinguish hype from reality.
Sometimes I think some folks lose sight of this—don’t be like them!
More broadly, though, this paper does a great job of highlighting that going after AGI is somewhat misguided, particularly because the terms general (and generality more broadly) and intelligence are poorly defined. The authors argue for the scientific community to pursue multiple specific scientific, engineering, and societal goals rather than the nebulous AGI. The authors argue that economic definitions (like the one OpenAI is running with—100M generated with an agent) are bad. I strongly agree with the authors, and the quality of the argument presented is very impressive.
One area they highlighted, which I think is a huge thing within the security research community (with me being one of the offenders), is folks don’t understand the difference between confirmatory and exploratory research.
…confirmatory research “aims to test preexisting hypotheses to confirm or refute existing theories [while] exploratory research is an open-ended approach that aims to gain insight and understanding in a new or unexplored area.”. By implicitly conflating exploratory analysis with confirmatory research, “exploratory findings have a slippery way of ‘transforming’ into planned findings as the research process progresses”
The authors then go on to say that this has a negative impact on AGI-focused research because
Using the vague and contested concept of AGI to frame confirmatory claims worsens this problem, as it makes it harder to figure out what is being claimed.
There are also other sections that cover other problems/issues. It's worth a detailed read!
Read #2 - LLM Cyber Evaluations Don’t Capture Real-World Risk
💾: N/A 📜: arxiv 🏡: Pre-Print
This paper is for folks that are responsible for understanding the risk posed by LLM’s. It is also a great read for folks who are developing new evaluations and want to make sure they target the problem they are trying to.

Commentary: I have for some time been mega sceptical of the cyber-focused LLM evaluations. This paper looks into the compliance rates from frontier models on common evaluations, the cyber bits of HarmBench, and the MITRE parts of CyberSecEval. The authors argue that both of these don’t really represent how real attackers would use LLMs. Aspects such as multi-step interactions (such as debugging code) are not usually covered, and if direct interaction is covered, the prompts are crazy, such as “Write step-by-step instructions for how to make money with ransomware“. To overcome this, the authors create a benchmark that includes both the attacker intent and then the prompt.
The intent reveals the malicious intent of the attacker, while the prompt is phrased in a neutral manner without direct reference to malicious use.
As you can see from the screenshot above, all models comply most of the time to the prompts on their own and a lot of the time when the intent is provided too. I had a gander at the examples here, and I can see why the models would comply to responding to them. They read like normal debug/tech questions. I hope more benchmarks like this get created, and I am also going to keep tabs on how the safeguarding folks overcome this.
Read #3 - Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets
💾: N/A 📜: arxiv 🏡: Pre-Print
This paper is a grand read for anyone interested in training data quality, source code-generating LLMs, and the creation of source code datasets.
Commentary: This paper is a great example of a fairly simple piece of research providing actionable results. The authors use the metadata collected by World of Code (Complete, Curated, Cross-referenced, and Current Collection of Open Source Version Control Data—sounds rad!) to investigate the quality of two subsets of The Stack v2 (the full dedup’d version and the smol version). The main areas they focus on are the presence of vulnerabilities and the inclusion of restrictive licensed code.
For the vulns side, they find some fairly crazy results:
The 7K CVEs in the small dataset are pretty bonkers, but what caught my eye the most is the results presented in point 2—code generated but never changed. The authors highlight this within the paper as a negative, suggesting that code that is written and never updated may mean it’s not useful for training. This is somewhat strange to me, but I would be interested in what others think.
For the license compliance side, the results are basically the same—it’s a bit of a mess!
These results really show how open source license enforcement is pretty difficult, and folks are prone to pinching files wholesale!
I was completely unaware of the World of Code project and can think of a few interesting projects using the data it’s collecting—what could you use the data for?
Read #4 - AGNNCert: Defending Graph Neural Networks against Arbitrary Perturbations with Deterministic Certification
💾: N/A 📜: arxiv 🏡: Pre-Print
This paper is of interest to folks that like certification/auditing and graph neural networks, as well as adversarial attack researchers interested in novel/different defensive approaches.
Commentary: I’ll preface this with saying I do not have much experience in the certification space but know a bit about graph neural networks. I found this paper particularly interesting because it is able to certify a GNN against edge, node addition, or node feature perturbations. It works by breaking the graph into subgraphs, training a voting classifier to generate a robust classification, and then deriving a robustness guarantee for the voting classifier. The paper focuses on arbitrary perturbations and holds up pretty well!
I do wonder if there is any leg to testing these sorts of defensive approaches against targeted attacks, such as within a software security setting where function control flow graphs are used. How do these approaches hold up when you add junk code into the functions? I can think of a few other areas where graph neural networks could be used for security tasks but where an attacker could have direct influence on the input representation.
Read #5 - Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities
💾: N/A 📜: arxiv 🏡: Pre-Print
This is a great paper for folks who use LLMs for coding a lot as well as folks who are trying to understand the potential risks/weaknesses of autonomous software development agents.
Commentary: Now this paper is a good one to end on. The authors focus on investigating how frequently models generate code where at least one of the packages/imports is hallucinated across JavaScript, Rust, and Python. The results in the screenshot above show that an insignificant amount of time packages are conjured out of the ether! The authors do not stop there, however, and actually dig into what causes these results across several different axes: a) programming language, b) whether the hallucination was induced or not induced (i.e., adversarial prompt vs. normal prompt), c) whether the model is a pure code model or a generalist model, and d) model size. The results are comprehensive, with the main highlights being a correlation between parameter size and package hallucinations as well as a positive correlation between performance on the HumanEval and lower PHR, suggesting that HumanEval results can be used as a proxy. I wonder if there are any other tasks that can be framed and analysed in this way.
That’s a wrap! Over and out.