3 comments

  • sueszli47 days ago
    woah, this got way more attention than i expected. thanks a lot.<p>if you are interested in the technical details, the design specs are here: <a href="https:&#x2F;&#x2F;github.com&#x2F;sueszli&#x2F;autograd.c&#x2F;blob&#x2F;main&#x2F;docs&#x2F;design.md" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;sueszli&#x2F;autograd.c&#x2F;blob&#x2F;main&#x2F;docs&#x2F;design....</a><p>if you are working on similar mlsys or compiler-style projects and think there could be overlap, please reach out: <a href="https:&#x2F;&#x2F;sueszli.github.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;sueszli.github.io&#x2F;</a>
  • PartiallyTyped47 days ago
    Any reason for creating a new tensor when accumulating grads over updating the existing one?<p>Edit: I asked this before I read the design decisions. Reasoning is, as far as I understand, that for simplificity no in-place operations hence accumulating it done on a new tensor.
    • sueszli47 days ago
      yeah, exactly. it&#x27;s for explicit ownership transfer. you always own what you receive, sum it, release both inputs, done. no mutation tracking, no aliasing concerns.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;sueszli&#x2F;autograd.c&#x2F;blob&#x2F;main&#x2F;src&#x2F;autograd.c#L170" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;sueszli&#x2F;autograd.c&#x2F;blob&#x2F;main&#x2F;src&#x2F;autograd...</a><p>i wonder whether there is a more clever way to do this without sacrificing simplicity.
  • spwa447 days ago
    Cool. But this makes me wonder. This negates most of the advantages of C. Is there a compiler-autograd &quot;library&quot;? Something that would compile into C specifically to execute as fast as possible on CPUs with no indirection at all.
    • thechao47 days ago
      At best you&#x27;d be restricted to the forward mode, which would still double stack pressure. If you needed reverse mode you&#x27;d need 2x stack, and the back sweep over the stack based tape would have the nearly perfectly unoptimal &quot;grain&quot;. If you allows the higher order operators (both push out and pull back), you&#x27;re going to end up with Jacobians &amp; Hessians over nontrivial blocks. That&#x27;s going to need the heap. It&#x27;s still better than an unbounded loop tape, though.<p>We had all these issues back in 2006 when my group was implementing autograd for C++ and, later, a computer algebra system called Axiom. We knew it&#x27;d be ideal for NN; I was trying to build this out for my brother who was porting AI models to GPUs. (This did not work in 2006 for both HW &amp; math reasons.)
      • spwa447 days ago
        Why not recompile every iteration? Weights are only updated at the end of the batch size at the earliest, and for distributed training, n batch sizes at the fastest, and generally only at the end of an iteration. In either case the cost of recompiling would be negligeable, no?
        • thechao47 days ago
          You&#x27;d pay the cost of the core computation O(n) times. Matrix products under the derivative fibration (jet; whatever your algebra calls it) are just more matrix products. A good sized NN is already in the heap. Also, the hard part is finding the ideal combination of fwd vs rev transforms (it&#x27;s NP hard). This is similar to the complexity of finding the ideal subblock matrix multiply orchestration.<p>So, the killer cost <i>is</i> at compile time, not runtime, which is fundamental to the underlying autograd operation.<p>On the flip side, it&#x27;s 2025, not 2006, so pro modern algorithms &amp; heuristics can change this story quite a bit.<p>All of this is spelled out in Griewank&#x27;s work (the book).
          • spwa447 days ago
            This one? <a href="https:&#x2F;&#x2F;epubs.siam.org&#x2F;doi&#x2F;book&#x2F;10.1137&#x2F;1.9780898717761" rel="nofollow">https:&#x2F;&#x2F;epubs.siam.org&#x2F;doi&#x2F;book&#x2F;10.1137&#x2F;1.9780898717761</a>
            • thechao46 days ago
              Yep. You can find used copies at some online places? Powell&#x27;s in Portland (online store) sometimes has it for 25 or 30 $s.
    • attractivechaos47 days ago
      &gt; <i>Is there a compiler-autograd &quot;library&quot;?</i><p>Do you mean the method theano is using? Anyway, the performance bottleneck often lies in matrix multiplication or 2D-CNN (which can be reduced to matmul). Compiler autograd wouldn&#x27;t save much time.
    • marcthe1247 days ago
      We would need to mirror jax architecture more. Since the jax is sort of jit arch wise. Basically you somehow need a good way to convert computational graph to machine code while at compile time also perform a set of operations on the graph.
    • justinnk47 days ago
      I believe Enzyme comes close to what you describe. It works on the LLVM IR level.<p><a href="https:&#x2F;&#x2F;enzyme.mit.edu" rel="nofollow">https:&#x2F;&#x2F;enzyme.mit.edu</a>
    • sueszli47 days ago
      a heap-free implementation could be a really cool direction to explore. thanks!<p>i think you might be interested in MLIR&#x2F;IREE: <a href="https:&#x2F;&#x2F;github.com&#x2F;openxla&#x2F;iree" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;openxla&#x2F;iree</a>