evaluating llms - Robuta Search

https://www.nature.com/articles/s41467-026-70245-1?fromPaywallRec=false&error=cookies_not_supported&code=1e6d3cc9-31a3-4eb6-a13c-933a03f8c7bb Evaluating LLMs' divergent thinking capabilities for scientific idea generation with minimal... Mar 7, 2026 - Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design. For instance, these... evaluating llms divergent thinking idea generation capabilities scientific https://huggingface.co/papers/2406.09170 Paper page - Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning Join the discussion on this paper page evaluating llms paper test time benchmark https://arxiv.org/abs/2503.21934 [2503.21934] Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad Abstract page for arXiv paper 2503.21934: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad evaluating llms 2025 usa math olympiad 2503 proof https://proceedings.neurips.cc/paper_files/paper/2024/hash/69d97a6493fbf016fff0a751f253ad18-Abstract-Datasets_and_Benchmarks_Track.html NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security open source benchmark evaluating llms offensive security nyu ctf https://arxiv.org/abs/2310.18130 [2310.18130] DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues Abstract page for arXiv paper 2310.18130: DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial Issues evaluating llms controversial issues 2310 18130 delphi https://newsroom-deezer.com/2025/04/naacl-gmichel-html/ Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3 - Deezer... Large Language Models (LLMs) have shown promising results in a variety of literary tasks, often using complex memorized details of narration and fictional... evaluating llms literary texts case study quotation attribution https://alignment.anthropic.com/2025/activation-oracles/ Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers evaluating llms general purpose activation oracles training https://aimojo.io/evaluating-toxicity-llms/ Evaluating Toxicity in LLMs: Can AI Really Be Safe in 2026? Jan 12, 2026 - LLM toxicity isn’t just a tech issue-it’s about trust, safety, and brand reputation. Learn how experts measure and fight harmful AI outputs in 2026. ai really safe 2026 evaluating toxicity llms https://www.amazon.science/publications/do-llms-recognize-your-preferences-evaluating-personalized-preference-following-in-llms Do LLMs recognize your preferences? Evaluating personalized preference following in LLMs - Amazon... Large Language Models (LLMs) are increasingly used as chatbots, yet their ability to personalize responses to user preferences remains limited. We introduce... llms recognize preferences evaluating personalized https://research.google/blog/evaluating-alignment-of-behavioral-dispositions-in-llms/ Evaluating alignment of behavioral dispositions in LLMs evaluating alignment behavioral dispositions llms https://ctrl-gaurav.github.io/LLMThinkBench/ LLMThinkBench - Evaluating Math Reasoning & Overthinking in LLMs LLMThinkBench: An open-source benchmarking framework for evaluating math reasoning and overthinking in Large Language Models. Leaderboard, analytics, and... math reasoning evaluating overthinking llms https://simonwillison.net/2025/Nov/24/claude-opus/ Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult Anthropic released Claude Opus 4.5 this morning, which they call “best model in the world for coding, agents, and computer use”. This is their attempt to... claude opus 4 evaluating new increasingly difficult 5 llms