6 comments

  • glitchc3 hours ago
    Very interesting. The state management is the really insightful find here.<p>I always wondered how these large AI companies managed access for millions of simultaneous users without having to allocate a dedicated LLM instance for each user. Pushing the complete state down to the user after every call makes perfect sense. The LLM itself stays memoryless and ready to respond to an arbitrary prompt. Very nice.
    • geocar3 hours ago
      N.B. This is <i>exactly</i> how seaside, vba, and even arc[1] do server-side state <i>generally</i>: by encrypting the blob-representing-state and sending to the client to be sent back on future requests (where it will be decrypted and rehydrated).<p>It&#x27;s an old trick that everyone designing protocols should know, since there are <i>lots</i> of applications beyond AI companies.<p>[1]: As in, pg&#x27;s lisp: <a href="https:&#x2F;&#x2F;arclanguage.github.io&#x2F;ref&#x2F;srv.html#:~:text=The%20previous%20section" rel="nofollow">https:&#x2F;&#x2F;arclanguage.github.io&#x2F;ref&#x2F;srv.html#:~:text=The%20pre...</a>
      • tn121 minutes ago
        And don&#x27;t forget the venerable .NET Forms with its kilobytes of __VIEWSTATE
    • b65e8bee43c2ed01 hour ago
      the exchange rate between text and its representation in memory is brutal. here&#x27;s a bit from a recent article:<p>&gt;An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.<p>262k tokens is not much at all. with ~5 characters per token, that&#x27;s only 1.3 MB of plaintext.
  • Groxx3 hours ago
    One possible use for the &quot;replay across accounts&quot;: if you can get a reasoning block that jailbreaks the model, you could share that block <i>without sharing how you did it</i>, and others can immediately take advantage of it too.
    • denysvitali1 hour ago
      Not necessarily for the &quot;without sharing&quot; part, but to increase the reliability of the jailbreak. The same prompt isn&#x27;t guaranteed to return the same result, but combining the internal thinking with the prompt might be a more effective way
  • hhh1 hour ago
    Awesome write-up. Seems like a great way to play with model responses now that prefill is gone.
  • Retr0id5 hours ago
    Very cool idea to use thinking duration (either in tokens or in wall time) as a side-channel!
  • Reubend4 hours ago
    Super cool side channel attack. I tend to agree that it&#x27;s pretty impractical, but it&#x27;s such a fun discovery!
  • haeseong2 hours ago
    [dead]