Sponsor of the Day:
Jerkmate
https://www.nature.com/articles/s41467-026-70245-1?fromPaywallRec=false&error=cookies_not_supported&code=1e6d3cc9-31a3-4eb6-a13c-933a03f8c7bb
Evaluating LLMs' divergent thinking capabilities for scientific idea generation with minimal...
Mar 7, 2026 - Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design. For instance, these...
evaluating llmsdivergent thinkingidea generationcapabilitiesscientific
https://huggingface.co/papers/2406.09170
Paper page - Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Join the discussion on this paper page
evaluating llmspapertesttimebenchmark
https://arxiv.org/abs/2503.21934
[2503.21934] Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
Abstract page for arXiv paper 2503.21934: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
evaluating llms2025 usamath olympiad2503proof
https://proceedings.neurips.cc/paper_files/paper/2024/hash/69d97a6493fbf016fff0a751f253ad18-Abstract-Datasets_and_Benchmarks_Track.html
NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
open source benchmarkevaluating llmsoffensive securitynyuctf
https://arxiv.org/abs/2310.18130
[2310.18130] DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues
Abstract page for arXiv paper 2310.18130: DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues
evaluating llmscontroversial issues231018130delphi
https://newsroom-deezer.com/2025/04/naacl-gmichel-html/
Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3 - Deezer...
Large Language Models (LLMs) have shown promising results in a variety of literary tasks, often using complex memorized details of narration and fictional...
evaluating llmsliterary textscase studyquotationattribution
https://alignment.anthropic.com/2025/activation-oracles/
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
evaluating llmsgeneral purposeactivationoraclestraining
https://aimojo.io/evaluating-toxicity-llms/
Evaluating Toxicity in LLMs: Can AI Really Be Safe in 2026?
Jan 12, 2026 - LLM toxicity isn’t just a tech issue-it’s about trust, safety, and brand reputation. Learn how experts measure and fight harmful AI outputs in 2026.
ai reallysafe 2026evaluatingtoxicityllms
https://www.amazon.science/publications/do-llms-recognize-your-preferences-evaluating-personalized-preference-following-in-llms
Do LLMs recognize your preferences? Evaluating personalized preference following in LLMs - Amazon...
Large Language Models (LLMs) are increasingly used as chatbots, yet their ability to personalize responses to user preferences remains limited. We introduce...
llmsrecognizepreferencesevaluatingpersonalized
https://research.google/blog/evaluating-alignment-of-behavioral-dispositions-in-llms/
Evaluating alignment of behavioral dispositions in LLMs
evaluatingalignmentbehavioraldispositionsllms
https://ctrl-gaurav.github.io/LLMThinkBench/
LLMThinkBench - Evaluating Math Reasoning & Overthinking in LLMs
LLMThinkBench: An open-source benchmarking framework for evaluating math reasoning and overthinking in Large Language Models. Leaderboard, analytics, and...
math reasoningevaluatingoverthinkingllms
https://simonwillison.net/2025/Nov/24/claude-opus/
Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult
Anthropic released Claude Opus 4.5 this morning, which they call “best model in the world for coding, agents, and computer use”. This is their attempt to...
claude opus 4evaluating newincreasingly difficult5llms