https://lrec.elra.info/lrec2024-main-0735
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language...
May 1, 2024 - Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on t
code generation
https://www.datalearner.com/en/benchmarks
LLM Benchmark Library | MMLU, GSM8K, HumanEval and More | DataLearnerAI
Explore mainstream LLM evaluation benchmarks including AIME 2025, SWE Bench Verified, MMLU, MMLU Pro, GSM8K, HumanEval, MBPP, HellaSwag, ARC, TruthfulQA,...
llm benchmarkand morelibrarymmluhumaneval
https://llm-stats.com/benchmarks/multipl-e-humaneval
Multipl-E HumanEval Benchmark Leaderboard
May 12, 2026 - MultiPL-E is a scalable and extensible approach to benchmarking neural code generation that translates unit test-driven code generation benchmarks across...
multiple