Robuta

https://lrec.elra.info/lrec2024-main-0735 HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language... May 1, 2024 - Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on t code generation https://www.datalearner.com/en/benchmarks LLM Benchmark Library | MMLU, GSM8K, HumanEval and More | DataLearnerAI Explore mainstream LLM evaluation benchmarks including AIME 2025, SWE Bench Verified, MMLU, MMLU Pro, GSM8K, HumanEval, MBPP, HellaSwag, ARC, TruthfulQA,... llm benchmarkand morelibrarymmluhumaneval https://llm-stats.com/benchmarks/multipl-e-humaneval Multipl-E HumanEval Benchmark Leaderboard May 12, 2026 - MultiPL-E is a scalable and extensible approach to benchmarking neural code generation that translates unit test-driven code generation benchmarks across... multiple