2 Comments

Okay so, just to make sure I'm understanding, the gist here is that in small batch inference, you don't care about gathering the experts for each token, whereas for training you care about routing tokens to experts without gathering the experts.

Don't you get a small version of the same problem for large prefills at inference time though?

Expand full comment
author
Feb 26·edited Feb 27Author

> Don't you get a small version of the same problem for large prefills at inference time though?

Yep! This code will not be fast for prefill either. One todo we have is to actually use a different implementation for prefill. Luckily for prefill the overhead is not a massive deal.

Expand full comment