Robuta

Sponsor of the Day: Jerkmate
https://www.nature.com/articles/s41467-026-70245-1?fromPaywallRec=false&error=cookies_not_supported&code=1e6d3cc9-31a3-4eb6-a13c-933a03f8c7bb Evaluating LLMs' divergent thinking capabilities for scientific idea generation with minimal... Mar 7, 2026 - Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design. For instance, these... evaluating llmsdivergent thinkingidea generationcapabilitiesscientific https://huggingface.co/papers/2406.09170 Paper page - Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning Join the discussion on this paper page evaluating llmspapertesttimebenchmark https://arxiv.org/abs/2503.21934 [2503.21934] Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad Abstract page for arXiv paper 2503.21934: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad evaluating llms2025 usamath olympiad2503proof https://proceedings.neurips.cc/paper_files/paper/2024/hash/69d97a6493fbf016fff0a751f253ad18-Abstract-Datasets_and_Benchmarks_Track.html NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security open source benchmarkevaluating llmsoffensive securitynyuctf https://arxiv.org/abs/2310.18130 [2310.18130] DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues Abstract page for arXiv paper 2310.18130: DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues evaluating llmscontroversial issues231018130delphi https://newsroom-deezer.com/2025/04/naacl-gmichel-html/ Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3 - Deezer... Large Language Models (LLMs) have shown promising results in a variety of literary tasks, often using complex memorized details of narration and fictional... evaluating llmsliterary textscase studyquotationattribution https://alignment.anthropic.com/2025/activation-oracles/ Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers evaluating llmsgeneral purposeactivationoraclestraining https://aimojo.io/evaluating-toxicity-llms/ Evaluating Toxicity in LLMs: Can AI Really Be Safe in 2026? Jan 12, 2026 - LLM toxicity isn’t just a tech issue-it’s about trust, safety, and brand reputation. Learn how experts measure and fight harmful AI outputs in 2026. ai reallysafe 2026evaluatingtoxicityllms https://www.amazon.science/publications/do-llms-recognize-your-preferences-evaluating-personalized-preference-following-in-llms Do LLMs recognize your preferences? Evaluating personalized preference following in LLMs - Amazon... Large Language Models (LLMs) are increasingly used as chatbots, yet their ability to personalize responses to user preferences remains limited. We introduce... llmsrecognizepreferencesevaluatingpersonalized https://research.google/blog/evaluating-alignment-of-behavioral-dispositions-in-llms/ Evaluating alignment of behavioral dispositions in LLMs evaluatingalignmentbehavioraldispositionsllms https://ctrl-gaurav.github.io/LLMThinkBench/ LLMThinkBench - Evaluating Math Reasoning & Overthinking in LLMs LLMThinkBench: An open-source benchmarking framework for evaluating math reasoning and overthinking in Large Language Models. Leaderboard, analytics, and... math reasoningevaluatingoverthinkingllms https://simonwillison.net/2025/Nov/24/claude-opus/ Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult Anthropic released Claude Opus 4.5 this morning, which they call “best model in the world for coding, agents, and computer use”. This is their attempt to... claude opus 4evaluating newincreasingly difficult5llms