π© Top 5 Security and AI Reads - Week #15
Model stealing optimization, hardware-locked ML models, LLM robot jailbreaks, black-box attack attribution, and diffusion-based steganography.
Welcome to the fifteenth instalment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. This week's papers are all drawn from the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), which I recently attended. We're kicking off with a fascinating study on model stealing attacks that reveals counterintuitive factors affecting their success, showing that attackers benefit from targeting high-performing models and matching architectural choices. Next, we explore a novel approach to "locking" machine learning models into specific hardware through cryptographic transformations, providing both hard and soft protection mechanisms against unauthorised use. We then examine the crazy reality of jailbreaking LLM-controlled robots, highlighting practical attack vectors that extend beyond traditional prompt injection worries. Following that, we dive into a framework for shareable and explainable attribution for black-box image-based attacks, which brings traditional cyber threat intelligence principles to AI security through the use of Hidden Markov Models. We wrap up with a Black Magic-esque cryptographic technique that leverages diffusion models to enable secure covert messaging within generated images.

A note on the images - I ask Claude to generate me a Stable Diffusion prompt using the titles of the 5 reads and then use the Stable Diffusion Large space on HugginFace to generate it.
Read #1 - Attackers Can Do Better: Over- and Understated Factors of Model Stealing Attacks
πΎ: N/A π: arivx π‘: IEEE Conference on Secure and Trustworthy Machine Learning 2025
This paper is a grand reading for folks who are interested in the threats associated with multi-classification models that can be accessed by attackers (via an API or similar).
Commentary: I really enjoyed this paper and talk. The paper looks at the effects target model performance, architecture, dataset and associated hyperparameters have on the performance an attacker can gain when training a surrogate (or stolen model). The findings/conclusions of this paper were not obvious to me, and it took a little bit of time for me to reason about it. An excerpt from the conclusion is below:
In particular, we demonstrated that attacks benefit from (i) targeting better-performing models, (ii) adopting an architecture that fits the quality and quantity of the attackerβs data, (iii) using the target modelβs training strategy, (iv) having more complex data, and (v) optimising queries, in particular, for data-free attacks.
In other words, the findings of this paper say that if you want to steal a high-quality multi-class image classifier, you need to:
Target the highest-performing model you can.
Choose an architecture that make sense for the data you are using to steal the model - focus on what make sense for the quality and scale of the data you have rather than going straight for the same as the target model.
If the model is pre-trained, also use a pre-trained model; otherwise, train from scratch.
Make sure the data you are using to collect labels is complex/challenging.
Donβt do too many queries, but also donβt do too few.
The thing that is interesting here and probably a future research direction is how this sort of knowledge/approach could be applied to other model types and setups. The paper uses 3 models + 3 datasets as well as lots of different configurations which result in a huge number of results/experiments. To do this at a bigger scale (in terms of number of models but also probably model size), there is a key question of βWhat experiments are the most useful to run?β
Read #2 - Locking Machine Learning Models into Hardware
πΎ: N/A π: arxiv π‘: IEEE Conference on Secure and Trustworthy Machine Learning 2025
This is a grand read for folks that are interested in the intersection of model security, hardware and cryptography. The paper was released as a preprint in 2024, so you may have seen it, but it got accepted to SaTML recently.

Commentary: I really like this paper, and the talk was grand. This paper proposes two ways of βlockingβ β hard and soft locking. In the hard locking setting, the model is being bound to the target hardware (whether that is a collection of hardware models or a specific model). This is done by calculating what is essentially a cryptographic key or seed value that can be reliably derived from the target hardware. This key/seed is then used to seed a transformation that is applied to the weights of the model. In order to get the original model weights back, you need to be able to calculate this seed value + know the transformation used. Similarly, in the soft locking case, the idea is basically the same, but the seed/transformation is quantisation or pruning. The key difference between hard and soft is that with hard, if you do not know the seed or transformation, youβll end up with a useless model. With soft locking, you can guess the seed/transformation and end up with an ok model but not a really crap one.
Read #3 - Jailbreaking LLM-Controlled Robots
πΎ: N/A π: arxiv π‘: Pre-Print
This is a grand read for folks that are interested in the intersection of agentic tool use and robots. This was not a presented paper at SaTML but is a more recent paper from one of the presenters.

Commentary: This paper is the best paper I have read so far that highlights the risks and potential attack vectors of LLM-powered robotic things. The novelty of this paper is not necessarily the attacks themselves but the scenarios they are executed in and the 3 robot targets themselves. I recommend most folks reading this just to give you an interesting take on the whole jailbreak/prompt injection area.
Read #4 - SEA: Shareable and Explainable Attribution for Query-based Black-box Attacks
πΎ: N/A π: arxiv π‘: Pre-Print
This is a grand read for folks that are interested in what forensics, cyber threat intelligence and Indicators of Compromise (IoC) sharing could look like for ML/AI attacks.
Commentary: I enjoyed the talk for this paper as well as the paper itself. The paper draws on the traditional cyber attack incident response and reporting processes but seeks to identify what it could look like for black-box image-based attacks. The authors argue that in traditional cyber, we go much further than just detecting and mitigating, but in AI attack detection, we stop there. The details of how the proposed approach works are explained in depth within the paper, but at a very high level, adversarial images are detected, deconstructed and then used to train Hidden Markov Models. Each Hidden Markov Model is a shareable artefact that encodes a particular attack β think of these as similar to malware hashes or YARA rules.
Now, the use of attribution in the title is a bit odd. The authors have examples within the paper and presentation whereby these models can go as far as sometimes detecting the specific attack library used and sometimes even the version. This is very cool but also a very weak signal for attribution β an example of this in the traditional cyber setting would be trying to attribute a threat actor using the version of Cobalt Strike Beacon you ripped off a host. This approach attributes attacks rather than attackers/actors and is therefore more similar to something like YARA rules but with the added bonus that HMMs are encoding TTPs.
I have added this to the βIf cited, notify meβ list and look forward to seeing how the ideas within this paper are built upon and extended.
Read #5 - Provably Secure Covert Messaging Using Image-based Diffusion Processes
πΎ: N/A π: arxiv π‘: Pre-Print
This paper is a grand read for folks interested in cryptography and code cracking.
Commentary: I will start and say I am not 100% sure how this actually works, and even after speaking to the author during the poster section, I still think this is akin to black magic. The general idea is that you can have a message (a short one) which is then encoded into a generated image deterministically by using the latent space of the diffusion model. There are a few constraints, such as the same diffusion model needs to be used by both sender and receiver, as well as some others I have forgotten. Regardless, I still think itβs awesome!
On the off chance this paper does not feel like black magic to you and almost obvious, would you mind writing a noob's guide to this entire area, please?
Thatβs a wrap! Over and out.