Description
Mixture of Experts (MoEs) are a class of models that have gained popularity in the open AI community due to their efficiency in pretraining large models or datasets. The article explains the concept of MoEs and how they are composed of sparse MoE layers, which replace dense feed-forward network layers and consist of multiple expert networks. These experts are selected by a gate network or router, and the decision of how tokens are routed to an expert is one of the key decisions in working with MoEs. The authors also discuss the challenges and tradeoffs of serving MoEs for inference, including the need for high VRAM due to all experts being loaded in memory, and the difficulties in fine-tuning but promising recent work with MoE instruction-tuning.