🎩 Top 5 Security and AI Reads - Week #12
Algebraic explainability attacks, benchmark contamination mitigations, LLM evaluation inconsistencies, efficient model inversion, and targeted image protection
Welcome to the twelfth installment of the Stats and Bytes Top 5 Security and AI Reads weekly newsletter. We're kicking off with an exploration of algebraic adversarial attacks on explainability models, where researchers have reframed the problem from constrained optimisation to an algebraic approach targeting model interpretability. Next, we examine a study questioning current LLM benchmark contamination mitigation strategies, revealing none perform statistically better than no mitigations at all. We then dive into the inconsistencies of LLM evaluation in multiple-choice questions, comparing different answer extraction methods and their systematic errors. Following that, we look at an efficient black-box model inversion attack that achieves impressive results with just 5% of the queries needed by current SOTA approaches. We conclude with TarPro, an innovative method for targeted protection against malicious image editing that prevents NSFW modifications while allowing normal edits to …
Keep reading with a 7-day free trial
Subscribe to Stats and Bytes to keep reading this post and get 7 days of free access to the full post archives.