🎩 Top 5 Security and AI Reads - Week #11
Repeated token vulnerabilities, LLM finetuning API attack vectors, effective VLM adversarial techniques, autonomous adversarial mitigation exploitation, and transformer robustness token defence
Welcome to the eleventh installment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're kicking off with a deep dive into the fascinating "Repeated Token Phenomenon" in large language models, where researchers use mechanistic interpretability to identify and mitigate model vulnerabilities. Next, we'll talk about some of the main problems with protecting LLM finetuning APIs, like how bad users might be able to use proxy outputs to make backdoored models. Next, we examine a surprisingly effective attack that achieves over a 90% success rate against even the strongest black-box VLMs by cleverly transferring semantic concepts between images. We'll also look at AutoAdvExBench, a new benchmark for evaluating AI agents' ability to defeat adversarial defences, and conclude with promising research on "Robustness Tokens" that enhance transformer models' resilience against adversarial attacks.
Keep reading with a 7-day free trial
Subscribe to Stats and Bytes to keep reading this post and get 7 days of free access to the full post archives.