Sponsor of the Day:
Jerkmate
https://www.sandia.gov/research/publications/details/binsimdb-benchmark-dataset-construction-for-fine-grained-binary-code-simila-2026-01-01/
BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis –...
benchmark datasetfine grainedbinary codeconstructionsimilarity
https://ieeexplore.ieee.org/document/8917818/
An Underwater Image Enhancement Benchmark Dataset and Beyond | IEEE Journals & Magazine | IEEE...
Underwater image enhancement has been attracting much attention due to its significance in marine engineering and aquatic robotics. Numerous underwater image en
ieee journals magazineimage enhancementbenchmark datasetunderwaterbeyond
https://proceedings.neurips.cc/paper_files/paper/2024/hash/69d97a6493fbf016fff0a751f253ad18-Abstract-Datasets_and_Benchmarks_Track.html
NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
open source benchmarkevaluating llmsoffensive securitynyuctf
https://aclanthology.org/2026.eacl-long.244/
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational...
Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, Craig Boutilier. Proceedings of the 19th Conference...
benchmark datasetvalidationframeworkusersimulators
https://www.semanticscholar.org/search?q=ALHD%3A+A+Large-Scale+and+Multigenre+Benchmark+Dataset+for+Arabic+LLM-Generated+Text+Detection.
ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection. |...
An academic search engine that utilizes artificial intelligence methods to provide highly relevant results and novel tools to filter them with ease.
large scalebenchmark datasetllm generatedtext detectionmultigenre
https://labs.scale.com/leaderboard/swe_bench_pro_public
SWE-Bench Pro Leaderboard AI Coding Benchmark (Public Dataset) | Scale
Apr 25, 2026 - Compare the resolve rates of GPT-5.4, Muse Spark, Claude Opus 4.6, and Gemini 3.1 Pro on SWE-Bench Pro. A rigorous AI software engineering benchmark for...
swe bench proai codingleaderboardbenchmarkpublic
https://research.feedzai.com/publication/benchmark-it-yourself-biy-preparing-a-dataset-and-benchmarking-ai-models-for-scatterplot-related-tasks/
Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related...
Nov 11, 2025 - AI models are increasingly used for data analysis and visualization, yet benchmarks rarely address scatterplot-specific tasks, limiting insight into...
benchmarking aibiypreparingdatasetmodels
https://writer.com/engineering/omniact-dataset-benchmark-multimodal-autonomous-agents/
OmniACT: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop...
Dec 11, 2024 - Discover OmniACT, a novel dataset and benchmark for evaluating multimodal generalist autonomous agents on desktop and web applications.
autonomous agentsdatasetbenchmarkenablingmultimodal