2 Comments

Okay so, just to make sure I'm understanding, the gist here is that in small batch inference, you don't care about gathering the experts for each token, whereas for training you care about routing tokens to experts without gathering the experts.

Don't you get a small version of the same problem for large prefills at inference time though?

Expand full comment