1 minute read

πŸ”— Link to the paper

This study lays bare an unsettling truth: aligned LLMs are not actually aligned at all. At least not when it comes to adversarial alignment. The LLMs’ tendency to generate harmful content hasn’t been eliminated, only slightly reduced. And it only takes a small amount of focused tampering to bring out the exact bad behaviors that alignment training tries to stop.

The sophistication of these attacks - automated, transferable, prompt-agnostic - represents a quantum leap over previous manual jailbreaking efforts. It’s a shot across the bow for the notion that we can sand down the rough edges on massive models trained on Internet detritus and end up with a safe, reliable, and ethical system. Perhaps it was always thus.

We can hardly fault the researchers for disclosing these findings in full. The curtain has been torn back, the wizard revealed. And the techniques, whilst advanced, build straightforwardly on established work. Like a magician explaining his trick, the illusion of safety is dispelled.

What now for alignment training and content moderation? We’re reminded how dynamic and adversarial this problem space is. Any fragility or overconfidence is soon exposed. The ease of this jailbreak underscores the need for more fundamental solutions - training methodologies and architectures designed from the ground up to yield only beneficial behaviors aligned to human values. Otherwise we risk papering over the cracks whilst objectionable foundations remain intact. Tread carefully indeed.