1 comments

  • zozbot23442 minutes ago
    &gt; Barely amortising at the bottom. At small batch each new token added to the batch tends to activate fresh experts<p>Whether this is true depends on what you mean by small. In general, AIUI you don&#x27;t need more than a handful of experts to get a meaningful probability of overlap. DeepSeek V4 Pro is an exceptionally sparse model and even there you start to get meaningful overlap for a batch size of 5 or more. Moreover, in general you can think of the average amount of activated experts for a batch of size b as being n(1 - (1 - k&#x2F;n)^b) where k is the number of active and n of total experts.