

Significant Performance Improvements in Llama-4-Scout with Latest vLLM Updates
In our previous benchmarking efforts, we observed surprisingly underwhelming performance from Meta's Llama-4-Scout model. Our initial tests showed that Llama-4-Scout was performing below Llama 3.3-70B, which raised questions about Meta's decision to release the model in its current state. However, recent discussions at various machine learning conferences have revealed an important insight that explains these results and offers a path forward for users of this model.
The Deployment Difference
What we discovered through conference discussions is that the benchmark performance issues weren't inherent to the model architecture itself, but rather stemmed from how the model is being served. While Meta uses their internal model serving infrastructure, virtually all external users deploy Llama-4 through third-party inference servers like vLLM, SGLang, and similar frameworks.
This realization pointed to a potential implementation gap in how these serving platforms were handling Llama-4-Scout's unique architecture, particularly its Mixture of Experts (MoE) design.
Recent vLLM Improvements
The vLLM team has been working diligently to address these issues, pushing several critical fixes in recent days. These updates include:
- Tuned FusedMoE kernel configurations (vLLM PR #16488)
- Fixed implementation to avoid sharing Llama4 QKNorm across Model Heads (vLLM PR #16257)
- Updates to underlying libraries (vLLM PR #16257)
- Numerous additional optimizations and fixes
Benchmark Results: Before and After
We were curious whether these fixes would improve model performance, so we re-ran our benchmarks with the updated vLLM version. The results were impressive:
Metric | Llama-4-Scout-17B-16E (vLLM v0.8.5.post1) | Previously Measured | Improvement |
---|---|---|---|
Accuracy | 51.9% | 46.4% | ↑ 5.5% |
Latency | 1.98s | 4.89s | ↓ 60% |
Hallucination Rate | 0.58% | 0.46% | ↑ 26% |
These results demonstrate substantial improvements in both accuracy and latency, but with an uptick in hallucination rate. The 5.5% accuracy increase is particularly significant in the context of large language model benchmarking, where the improvements now rank Llama4-Scout higher than the larger Llama 3.3-70b model.
Implications for Production Deployments
Based on our testing, we strongly recommend updating your vLLM deployment to the latest version if you're using Llama-4-Scout in production environments. The substantial performance gains and latency reductions make this update particularly valuable for production systems where both response quality and speed are critical factors.
Conclusion
This experience highlights an important consideration when evaluating new models: the serving infrastructure can significantly impact performance metrics. In the case of Llama-4-Scout, the model's capabilities were partially obscured by implementation challenges in the serving layer.
The rapid response from the vLLM team demonstrates the strength of the open-source ML ecosystem, where critical improvements can be quickly developed and deployed when performance issues are identified.
For more detailed information about these improvements or assistance with optimizing your Llama-4 deployment, please reach out to our team at ml@digits.com.
Big thanks to Baseten for supporting our benchmarking work.