3 comments

  • wood_spirit11 minutes ago
    Intriguing and very cunning attack! So obvious in hindsight!<p>It makes me wonder how Deepseek avoids commenting politically on China? I have heard anecdotes that it will be writing out a long reply and then presumably it generates some forbidden phrase and it abandons the output and replaces it all with an error message. So presumably the safeguards could be a separate trivial non-LLM-based post filtering which makes it immune to the doublespeak attack?
  • acjohnson5541 minutes ago
    These types of attacks are interesting ways in which LLM &quot;thinking&quot; differs from human thinking.
  • measurablefunc55 minutes ago
    This means whatever NNs are currently used for &quot;safety&quot; will need to be extended. In the limit you essentially get another network of the same width &amp; depth as the original network but which is designed for rejecting all &quot;unsafe&quot; queries which are context hijacking bomb construction with stories about fruits.