swe bench - Robuta Search

https://github.com/SWE-bench/experiments/pull/397 Add results for SWE-Bench Lite for Potpie AI by dhirenmathur · Pull Request #397 ·... Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task. - Add results for SWE-Bench... swe bench potpie ai pull request add results https://huggingface.co/SWE-bench/SWE-agent-LM-32B SWE-bench/SWE-agent-LM-32B · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science. swe bench agent lm hugging face https://epoch.ai/blog/swebench-docker How to run SWE-bench Verified in one hour on one machine | Epoch AI We are releasing a public registry of optimized Docker images for SWE-bench. This allows us to run SWE-bench Verified in 62 minutes on a single GitHub actions... swe bench one hour run verified machine https://arxiv.org/abs/2310.06770 [2310.06770] SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Abstract page for arXiv paper 2310.06770: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? swe bench language models real world resolve github https://www.swebench.com/multimodal.html SWE-bench Multimodal swe bench multimodal https://refact.ai/blog/2025/updates-for-may-2025/ May Updates: Top open-source AI Agent on SWE-bench; What sparked CTOs' interest at Dublin Tech... May Updates: Top open-source AI Agent on SWE-bench; What sparked CTOs' interest at Dublin Tech Summit? top open source ai agent swe bench may updates https://huggingface.co/collections/SWE-bench/swe-bench SWE-bench - a SWE-bench Collection SWE-bench (Lite, Verified, Multimodal, Multilingual) all in one place! swe bench collection https://www.swebench.com/ SWE-bench Leaderboards swe bench leaderboards https://www.tudingai.com/sites/3310.html SWE-Bench Pro - 新一代软件工程 AI 基准测试集。AI 编程领域的“图灵测试” | 图钉AI导航 swe bench pro ai https://refact.ai/blog/2025/1-agent-on-swe-bench-verified-using-claude-4-sonnet/ Refact.ai Agent achieved leading results on SWE-bench Multimodal and Verified - Refact.ai Refact.ai Agent achieved leading results on SWE-bench Multimodal and Verified refact ai swe bench agent achieved leading https://www.swebench.com/contact.html Contact SWE-bench Team swe bench team https://automatio.ai/models/kimi-k2-6 Kimi k2.6: 1T MoE Model with 80.2% SWE-Bench Score Kimi k2.6 is Moonshot AI's 1T-parameter MoE model featuring a 256K context window, native video input, and elite performance in autonomous agentic coding. swe bench kimi moe model score https://nexu.io/blog/qwen-3-6-35b-a3b-open-source-moe Qwen 3.6-35B-A3B Is Open: 3B Active Params, 73.4% SWE-bench, Drops Into nexu Tonight — nexu Apr 20, 2026 - Alibaba open-sourced Qwen3.6-35B-A3B on April 16: a sparse MoE with only 3B active parameters that scores 73.4% on SWE-bench Verified and 1M-token context with... swe bench qwen open active params https://www.swebench.com/verified.html SWE-bench Verified swe bench verified https://bito.ai/benchmarks/swe-bench-pro-evaluation/ AI Architect tops SWE-Bench Pro | 35% higher task success | Bito Apr 24, 2026 - A benchmark-based evaluation of how deep system context boosts coding agent success by 35% on long-horizon tasks in large, real-world codebases. ai architect swe bench tops pro higher https://www.marc0.dev/en/leaderboard SWE-Bench Leaderboard May 2026 | GPT-5.5 Leads at 88.7% swe bench leaderboard may gpt leads https://www.ai21.com/blog/scaling-agentic-evaluation-swe-bench/ Agentic Evaluation: Lessons from 200,000 SWE-bench Runs Mar 25, 2026 - How we scaled agentic evaluation to 200,000 SWE-bench runs. Infrastructure design for isolation, throughput, and resumable execution. swe bench agentic evaluation lessons runs https://conf.researchr.org/details/icse-2026/icse-2026-software-engineering-in-practice/29/The-SWE-Bench-Illusion-When-State-of-the-Art-LLMs-Remember-Instead-of-Reason The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason (ICSE 2026 - Software... Call for papers The Software Engineering in Practice (SEIP) track of ICSE is the premier venue for practitioners and researchers to discuss insights,... swe bench illusion state art llms https://evolink.ai/claude-opus-4-1 Claude Opus 4.1 API: 74.5% SWE-bench, Agentic Coding | EvoLink Claude Opus 4.1 API — Anthropic's recommended upgrade from Opus 4. 74.5% SWE-bench Verified, multi-file refactoring, and deep reasoning. Access via EvoLink... claude opus swe bench agentic coding api evolink https://www.morphllm.com/swe-bench-pro SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81% Live SWE-Bench Pro rankings with SEAL scores, agent systems, and Verified. The best model scores 46% on Pro but 81% on Verified, because Verified is... swe bench pro leaderboard beats https://www.swebench.com/blog.html SWE-bench Blog swe bench blog https://llm-stats.com/benchmarks/swe-bench-verified SWE-Bench Verified Leaderboard Jun 3, 2026 - SWE-Bench Verified leaderboard — Claude Mythos Preview leads 92 AI models at 0.939. A verified subset of 500 software engineering problems from real GitHub iss… swe bench verified leaderboard https://www.swebench.com/lite.html SWE-bench Lite swe bench lite https://automatio.ai/models/qwen3-6-max-preview Qwen3.6-Max-Preview: 1M Context & Top SWE-Bench Scores Qwen3.6-Max-Preview is Alibaba's flagship MoE model featuring 1M context, a native thinking mode, and SOTA scores in agentic coding and reasoning. swe bench max preview context top https://www.augmentcode.com/blog/auggie-tops-swe-bench-pro Auggie tops SWE-Bench Pro | Augment Code Feb 4, 2026 - The most powerful AI software development platform with the industry-leading context engine. swe bench tops pro augment code https://www.vals.ai/benchmarks/swebench SWE-bench Verified Private, domain-specific benchmarks in legal, tax, and finance. swe bench verified https://www.aitags.cn/sites/1380.html KAT Coder - 快手Kwaipilot出品的先进AI编码助手 | 73.4% SWE-Bench | AI标签页 Oct 24, 2025 - KAT Coder 是快手Kwaipilot团队研发的旗舰AI编码模型，基于先进的智能体强化学习和MoE架构，擅长自主完成复杂的软件工程任务。 swe bench kat coder https://www.openaitoolshub.org/en/blog/qwen-code-review Qwen Code Review — Qwen CLI Features, Free Pricing, 69.6% SWE-bench | OpenAIToolsHub Mar 20, 2026 - Hands-on Qwen Code review — Alibaba's open-source terminal coding agent, Gemini CLI fork, Qwen3-Coder 69.6% SWE-bench, model-agnostic, and completely free.... code review features free swe bench qwen cli https://openlm.ai/swe-bench/ SWE-bench + | OpenLM.ai SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language... swe bench openlm ai https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate What skills does SWE-bench Verified evaluate? | Epoch AI We take a deep dive into SWE-bench Verified, a prominent agentic coding benchmark. While one of the best public tests of AI coding agents, it is limited by its... swe bench skills verified evaluate epoch https://refact.ai/blog/2025/sota-on-swe-bench-lite-open-source-refact-ai/ Open-Source Refact.ai Agent is SOTA on SWE-bench Lite With a 60.0% Score - Refact.ai Refact.ai Agent has achieved the #1 score on SWE-bench Lite — solving 179 out of 300 tasks, for a 60.0% success rate. open source refact ai swe bench agent sota https://automatio.ai/models/claude-opus-4-7 Claude Opus 4.7: 1M Context & 87.6% SWE-bench Result Claude Opus 4.7 is Anthropic's flagship model with a 1-million-token context, adaptive reasoning, and 3.3x vision resolution for enterprise-scale agents. claude opus swe bench context result https://nomosinsights.com/blog/swe-bench-reasoning-annotation-learnings SWE-Bench Reasoning Annotation: What We Learned from 500+ Trajectories - Nomos Insights Blog |... Pass or fail only tells you if an AI agent solved a problem. It tells you nothing about how it reasoned, where it went wrong, or what made one agent... swe bench reasoning annotation learned trajectories https://winbuzzer.com/2025/11/24/anthropic-launches-claude-opus-4-5-with-80-9-swe-bench-score-and-66-price-drop-xcxwbn/ Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop Anthropic has released Claude Opus 4.5, claiming an industry-leading 80.9% coding score and introducing anthropic launches claude opus swe bench score price https://www.swebench.com/index.html SWE-bench Leaderboards swe bench leaderboards https://www.tudingai.com/sitetag/swe-bench-pro SWE-Bench Pro | 图钉AI导航图钉AI导航一个专注专注收录优质上百款免费AI工具的导航网站，包括AI写作工具、AI绘画修图工具、AI视频音频工具、AI写代码编程工具、以及其他一些交流社区和开放平台，都经过了作者精心筛选，拿来就能用！除了AI产品的分享，网站内还包含了AI相关资讯以及AI使用教程。 swe bench pro https://iclr.cc/virtual/2025/poster/28177 ICLR Poster SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? swe bench ai systems iclr poster multimodal https://www.openaitoolshub.org/en/blog/gpt-5-4-developer-review GPT-5.4 for Developers: API Pricing, Computer Use, and SWE-bench 80% | OpenAIToolsHub Mar 20, 2026 - GPT-5.4 developer review covering the $2.50/$15 API pricing (half Claude Opus cost), 1M context window, 80% SWE-bench, 75% OSWorld computer use, and... api pricing computer use swe bench gpt developers https://www.swebench.com/multilingual-leaderboard.html SWE-bench Multilingual swe bench multilingual https://www.swebench.com/press.html SWE-bench Press swe bench press https://huggingface.co/SWE-bench SWE-bench (SWE-bench) Org profile for SWE-bench on Hugging Face, the AI community building the future. swe bench https://arxiv.org/abs/2602.08316 [2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding Abstract page for arXiv paper 2602.08316: SWE Context Bench: A Benchmark for Context Learning in Coding swe context bench learning coding https://www.swebench.com/SWE-bench/guides/docker_setup/ Docker Setup - SWE-bench docker setup swe bench https://www.morphllm.com/comparisons/cursor-alternatives Cursor Alternatives (2026): We Tested 7 Tools and the $0 One Scored 80.8% on SWE-bench We tested 7 Cursor alternatives on real codebases. The free option scored 80.8% SWE-bench Verified. The $10/mo option runs 3 agents simultaneously. Full... cursor alternatives tested tools one scored https://www.swebench.com/SWE-bench/ Overview - SWE-bench overview swe bench