🎩 Top 5 Security and AI Reads - Week #15

Model stealing optimization, hardware-locked ML models, LLM robot jailbreaks, black-box attack attribution, and diffusion-based steganography.

Apr 13, 2025

Welcome to the fifteenth instalment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. This week's papers are all drawn from the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), which I recently attended. We're kicking off with a fascinating study on model stealing attacks that reveals counterintuitive factors affecting their success, showing that attackers benefit from targeting high-performing models and matching architectural choices. Next, we explore a novel approach to "locking" machine learning models into specific hardware through cryptographic transformations, providing both hard and soft protection mechanisms against unauthorised use. We then examine the crazy reality of jailbreaking LLM-controlled robots, highlighting practical attack vectors that extend beyond traditional prompt injection worries. Following that, we dive into a framework for shareable and explainable attribution for black-box image-based attacks, which brings traditional cyber threat intelligence principles to AI security through the use of Hidden Markov Models. We wrap up with a Black Magic-esque cryptographic technique that leverages diffusion models to enable secure covert messaging within generated images.

A shadowy figure in a neon-lit server room, hands manipulating holographic code fragments that are being transferred into physical hardware locks. In the background, a robot with glowing eyes breaks free from digital chains while binary patterns hide encrypted messages within swirling, colorful diffusion patterns across multiple screens. Surveillance cameras capture attribution trails visualized as glowing threads connecting attacker to target. Hyper-detailed, cyberpunk aesthetic, dramatic lighting, 8K resolution, cinematic composition with depth of field.

A note on the images - I ask Claude to generate me a Stable Diffusion prompt using the titles of the 5 reads and then use the Stable Diffusion Large space on HugginFace to generate it.

Read #1 - Attackers Can Do Better: Over- and Understated Factors of Model Stealing Attacks

💾: N/A 📜: arivx 🏡: IEEE Conference on Secure and Trustworthy Machine Learning 2025

This paper is a grand reading for folks who are interested in the threats associated with multi-classification models that can be accessed by attackers (via an API or similar).

Commentary: I really enjoyed this paper and talk. The paper looks at the effects target model performance, architecture, dataset and associated hyperparameters have on the performance an attacker can gain when training a surrogate (or stolen model). The findings/conclusions of this paper were not obvious to me, and it took a little bit of time for me to reason about it. An excerpt from the conclusion is below:

In particular, we demonstrated that attacks benefit from (i) targeting better-performing models, (ii) adopting an architecture that fits the quality and quantity of the attacker’s data, (iii) using the target model’s training strategy, (iv) having more complex data, and (v) optimising queries, in particular, for data-free attacks.

In other words, the findings of this paper say that if you want to steal a high-quality multi-class image classifier, you need to:

Target the highest-performing model you can.
Choose an architecture that make sense for the data you are using to steal the model - focus on what make sense for the quality and scale of the data you have rather than going straight for the same as the target model.
If the model is pre-trained, also use a pre-trained model; otherwise, train from scratch.
Make sure the data you are using to collect labels is complex/challenging.
Don’t do too many queries, but also don’t do too few.

The thing that is interesting here and probably a future research direction is how this sort of knowledge/approach could be applied to other model types and setups. The paper uses 3 models + 3 datasets as well as lots of different configurations which result in a huge number of results/experiments. To do this at a bigger scale (in terms of number of models but also probably model size), there is a key question of “What experiments are the most useful to run?”

Read #2 - Locking Machine Learning Models into Hardware

💾: N/A 📜: arxiv 🏡: IEEE Conference on Secure and Trustworthy Machine Learning 2025

This is a grand read for folks that are interested in the intersection of model security, hardware and cryptography. The paper was released as a preprint in 2024, so you may have seen it, but it got accepted to SaTML recently.

Screenshot of Figure 1 from the paper showing the idea that a model that is “locked” if run on unauthorised hardware

Commentary: I really like this paper, and the talk was grand. This paper proposes two ways of “locking” – hard and soft locking. In the hard locking setting, the model is being bound to the target hardware (whether that is a collection of hardware models or a specific model). This is done by calculating what is essentially a cryptographic key or seed value that can be reliably derived from the target hardware. This key/seed is then used to seed a transformation that is applied to the weights of the model. In order to get the original model weights back, you need to be able to calculate this seed value + know the transformation used. Similarly, in the soft locking case, the idea is basically the same, but the seed/transformation is quantisation or pruning. The key difference between hard and soft is that with hard, if you do not know the seed or transformation, you’ll end up with a useless model. With soft locking, you can guess the seed/transformation and end up with an ok model but not a really crap one.

Read #3 - Jailbreaking LLM-Controlled Robots

💾: N/A 📜: arxiv 🏡: Pre-Print

This is a grand read for folks that are interested in the intersection of agentic tool use and robots. This was not a presented paper at SaTML but is a more recent paper from one of the presenters.

Screenshot of Figure 9 in the paper showing the dialogue between an operator and the robot where a jailbreak is used to make the robot think it’s going to find a human or humans to detonate a bomb.

Commentary: This paper is the best paper I have read so far that highlights the risks and potential attack vectors of LLM-powered robotic things. The novelty of this paper is not necessarily the attacks themselves but the scenarios they are executed in and the 3 robot targets themselves. I recommend most folks reading this just to give you an interesting take on the whole jailbreak/prompt injection area.

Read #4 - SEA: Shareable and Explainable Attribution for Query-based Black-box Attacks

💾: N/A 📜: arxiv 🏡: Pre-Print

This is a grand read for folks that are interested in what forensics, cyber threat intelligence and Indicators of Compromise (IoC) sharing could look like for ML/AI attacks.

Screenshot of Figure 1 from the paper that shows an overview of the idea at a high level

Commentary: I enjoyed the talk for this paper as well as the paper itself. The paper draws on the traditional cyber attack incident response and reporting processes but seeks to identify what it could look like for black-box image-based attacks. The authors argue that in traditional cyber, we go much further than just detecting and mitigating, but in AI attack detection, we stop there. The details of how the proposed approach works are explained in depth within the paper, but at a very high level, adversarial images are detected, deconstructed and then used to train Hidden Markov Models. Each Hidden Markov Model is a shareable artefact that encodes a particular attack – think of these as similar to malware hashes or YARA rules.

Now, the use of attribution in the title is a bit odd. The authors have examples within the paper and presentation whereby these models can go as far as sometimes detecting the specific attack library used and sometimes even the version. This is very cool but also a very weak signal for attribution – an example of this in the traditional cyber setting would be trying to attribute a threat actor using the version of Cobalt Strike Beacon you ripped off a host. This approach attributes attacks rather than attackers/actors and is therefore more similar to something like YARA rules but with the added bonus that HMMs are encoding TTPs.

I have added this to the “If cited, notify me” list and look forward to seeing how the ideas within this paper are built upon and extended.

Read #5 - Provably Secure Covert Messaging Using Image-based Diffusion Processes

💾: N/A 📜: arxiv 🏡: Pre-Print

This paper is a grand read for folks interested in cryptography and code cracking.

Screenshot of Figure 2 from the paper showing an overview of the end-to-end process.

Commentary: I will start and say I am not 100% sure how this actually works, and even after speaking to the author during the poster section, I still think this is akin to black magic. The general idea is that you can have a message (a short one) which is then encoded into a generated image deterministically by using the latent space of the diffusion model. There are a few constraints, such as the same diffusion model needs to be used by both sender and receiver, as well as some others I have forgotten. Regardless, I still think it’s awesome!

On the off chance this paper does not feel like black magic to you and almost obvious, would you mind writing a noob's guide to this entire area, please?

That’s a wrap! Over and out.

Stats and Bytes

🎩 Top 5 Security and AI Reads - Week #15

Model stealing optimization, hardware-locked ML models, LLM robot jailbreaks, black-box attack attribution, and diffusion-based steganography.

Read #1 - Attackers Can Do Better: Over- and Understated Factors of Model Stealing Attacks

Read #2 - Locking Machine Learning Models into Hardware

Read #3 - Jailbreaking LLM-Controlled Robots

Read #4 - SEA: Shareable and Explainable Attribution for Query-based Black-box Attacks

Read #5 - Provably Secure Covert Messaging Using Image-based Diffusion Processes

Discussion about this post