Quantifying variance in evaluation benchmarks Lovish Madaan, Aaditya K Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, Dieuwke Hupkes arXiv preprint Under Review June 2024 Language Models Evaluation Benchmarks Statistics arXiv Summary Quantifying and understanding variance in LLM evaluation benchmarks.