SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
- Zichong Li ,
- Chen Liang ,
- Zixuan Zhang ,
- Ilgee Hong ,
- Young Jin Kim ,
- Weizhu Chen ,
- Tuo Zhao
The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their substantial memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we propose \textit{SlimMoE}, a multi-stage compression framework that transforms large MoE models into significantly smaller and more efficient variants without the cost of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation typical of one-shot pruning. Using SlimMoE, we compress Phi-3.5-MoE (41.9B total / 6.6B activated parameters) into two smaller models: Phi-mini-MoE (7.6B total / 2.4B activated) and Phi-tiny-MoE (3.8B total / 1.1B activated), using only 400B tokens — less than 10% of the original training data. These models can be fine-tuned on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them well suited for academic and resource-limited settings. Our experiments show that the compressed models outperform others of similar size and remain competitive with larger models. For example, Phi-mini-MoE matches or exceeds the performance of Phi-3-mini while using only two-thirds of the activated parameters and achieves comparable MMLU scores to LLaMA 3.1 8B with significantly lower latency. These results highlight that structured pruning combined with multi-stage distillation is an effective strategy for building high-quality, compact MoE models, enabling broader adoption of MoE architectures across diverse computational environments.