Robuta

https://github.com/SWE-bench/experiments/pull/397 Add results for SWE-Bench Lite for Potpie AI by dhirenmathur · Pull Request #397 ·... Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task. - Add results for SWE-Bench... swe benchpotpie aipull requestaddresults https://huggingface.co/SWE-bench/SWE-agent-LM-32B SWE-bench/SWE-agent-LM-32B · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science. swe benchagentlmhuggingface https://epoch.ai/blog/swebench-docker How to run SWE-bench Verified in one hour on one machine | Epoch AI We are releasing a public registry of optimized Docker images for SWE-bench. This allows us to run SWE-bench Verified in 62 minutes on a single GitHub actions... swe benchone hourrunverifiedmachine https://arxiv.org/abs/2310.06770 [2310.06770] SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Abstract page for arXiv paper 2310.06770: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? swe benchlanguage modelsreal worldresolvegithub https://www.swebench.com/multimodal.html SWE-bench Multimodal swe benchmultimodal https://refact.ai/blog/2025/updates-for-may-2025/ May Updates: Top open-source AI Agent on SWE-bench; What sparked CTOs' interest at Dublin Tech... May Updates: Top open-source AI Agent on SWE-bench; What sparked CTOs' interest at Dublin Tech Summit? top open sourceai agentswe benchmayupdates https://huggingface.co/collections/SWE-bench/swe-bench SWE-bench - a SWE-bench Collection SWE-bench (Lite, Verified, Multimodal, Multilingual) all in one place! swe benchcollection https://www.swebench.com/ SWE-bench Leaderboards swe benchleaderboards https://www.tudingai.com/sites/3310.html SWE-Bench Pro - 新一代软件工程 AI 基准测试集。AI 编程领域的“图灵测试” | 图钉AI导航 swe benchproai https://refact.ai/blog/2025/1-agent-on-swe-bench-verified-using-claude-4-sonnet/ Refact.ai Agent achieved leading results on SWE-bench Multimodal and Verified - Refact.ai Refact.ai Agent achieved leading results on SWE-bench Multimodal and Verified refact aiswe benchagentachievedleading https://www.swebench.com/contact.html Contact SWE-bench Team swe benchteam https://automatio.ai/models/kimi-k2-6 Kimi k2.6: 1T MoE Model with 80.2% SWE-Bench Score Kimi k2.6 is Moonshot AI's 1T-parameter MoE model featuring a 256K context window, native video input, and elite performance in autonomous agentic coding. swe benchkimimoemodelscore https://nexu.io/blog/qwen-3-6-35b-a3b-open-source-moe Qwen 3.6-35B-A3B Is Open: 3B Active Params, 73.4% SWE-bench, Drops Into nexu Tonight — nexu Apr 20, 2026 - Alibaba open-sourced Qwen3.6-35B-A3B on April 16: a sparse MoE with only 3B active parameters that scores 73.4% on SWE-bench Verified and 1M-token context with... swe benchqwenopenactiveparams https://www.swebench.com/verified.html SWE-bench Verified swe benchverified https://bito.ai/benchmarks/swe-bench-pro-evaluation/ AI Architect tops SWE-Bench Pro | 35% higher task success | Bito Apr 24, 2026 - A benchmark-based evaluation of how deep system context boosts coding agent success by 35% on long-horizon tasks in large, real-world codebases. ai architectswe benchtopsprohigher https://www.marc0.dev/en/leaderboard SWE-Bench Leaderboard May 2026 | GPT-5.5 Leads at 88.7% swe benchleaderboardmaygptleads https://www.ai21.com/blog/scaling-agentic-evaluation-swe-bench/ Agentic Evaluation: Lessons from 200,000 SWE-bench Runs Mar 25, 2026 - How we scaled agentic evaluation to 200,000 SWE-bench runs. Infrastructure design for isolation, throughput, and resumable execution. swe benchagenticevaluationlessonsruns https://conf.researchr.org/details/icse-2026/icse-2026-software-engineering-in-practice/29/The-SWE-Bench-Illusion-When-State-of-the-Art-LLMs-Remember-Instead-of-Reason The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason (ICSE 2026 - Software... Call for papers The Software Engineering in Practice (SEIP) track of ICSE is the premier venue for practitioners and researchers to discuss insights,... swe benchillusionstateartllms https://evolink.ai/claude-opus-4-1 Claude Opus 4.1 API: 74.5% SWE-bench, Agentic Coding | EvoLink Claude Opus 4.1 API — Anthropic's recommended upgrade from Opus 4. 74.5% SWE-bench Verified, multi-file refactoring, and deep reasoning. Access via EvoLink... claude opusswe benchagentic codingapievolink https://www.morphllm.com/swe-bench-pro SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81% Live SWE-Bench Pro rankings with SEAL scores, agent systems, and Verified. The best model scores 46% on Pro but 81% on Verified, because Verified is... swe benchproleaderboardbeats https://www.swebench.com/blog.html SWE-bench Blog swe benchblog https://llm-stats.com/benchmarks/swe-bench-verified SWE-Bench Verified Leaderboard Jun 3, 2026 - SWE-Bench Verified leaderboard — Claude Mythos Preview leads 92 AI models at 0.939. A verified subset of 500 software engineering problems from real GitHub iss… swe benchverifiedleaderboard https://www.swebench.com/lite.html SWE-bench Lite swe benchlite https://automatio.ai/models/qwen3-6-max-preview Qwen3.6-Max-Preview: 1M Context & Top SWE-Bench Scores Qwen3.6-Max-Preview is Alibaba's flagship MoE model featuring 1M context, a native thinking mode, and SOTA scores in agentic coding and reasoning. swe benchmaxpreviewcontexttop https://www.augmentcode.com/blog/auggie-tops-swe-bench-pro Auggie tops SWE-Bench Pro | Augment Code Feb 4, 2026 - The most powerful AI software development platform with the industry-leading context engine. swe benchtopsproaugmentcode https://www.vals.ai/benchmarks/swebench SWE-bench Verified Private, domain-specific benchmarks in legal, tax, and finance. swe benchverified https://www.aitags.cn/sites/1380.html KAT Coder - 快手Kwaipilot出品的先进AI编码助手 | 73.4% SWE-Bench | AI标签页 Oct 24, 2025 - KAT Coder 是快手Kwaipilot团队研发的旗舰AI编码模型,基于先进的智能体强化学习和MoE架构,擅长自主完成复杂的软件工程任务。 swe benchkatcoder https://www.openaitoolshub.org/en/blog/qwen-code-review Qwen Code Review — Qwen CLI Features, Free Pricing, 69.6% SWE-bench | OpenAIToolsHub Mar 20, 2026 - Hands-on Qwen Code review — Alibaba's open-source terminal coding agent, Gemini CLI fork, Qwen3-Coder 69.6% SWE-bench, model-agnostic, and completely free.... code reviewfeatures freeswe benchqwencli https://openlm.ai/swe-bench/ SWE-bench + | OpenLM.ai SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language... swe benchopenlmai https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate What skills does SWE-bench Verified evaluate? | Epoch AI We take a deep dive into SWE-bench Verified, a prominent agentic coding benchmark. While one of the best public tests of AI coding agents, it is limited by its... swe benchskillsverifiedevaluateepoch https://refact.ai/blog/2025/sota-on-swe-bench-lite-open-source-refact-ai/ Open-Source Refact.ai Agent is SOTA on SWE-bench Lite With a 60.0% Score - Refact.ai Refact.ai Agent has achieved the #1 score on SWE-bench Lite — solving 179 out of 300 tasks, for a 60.0% success rate. open sourcerefact aiswe benchagentsota https://automatio.ai/models/claude-opus-4-7 Claude Opus 4.7: 1M Context & 87.6% SWE-bench Result Claude Opus 4.7 is Anthropic's flagship model with a 1-million-token context, adaptive reasoning, and 3.3x vision resolution for enterprise-scale agents. claude opusswe benchcontextresult https://nomosinsights.com/blog/swe-bench-reasoning-annotation-learnings SWE-Bench Reasoning Annotation: What We Learned from 500+ Trajectories - Nomos Insights Blog |... Pass or fail only tells you if an AI agent solved a problem. It tells you nothing about how it reasoned, where it went wrong, or what made one agent... swe benchreasoningannotationlearnedtrajectories https://winbuzzer.com/2025/11/24/anthropic-launches-claude-opus-4-5-with-80-9-swe-bench-score-and-66-price-drop-xcxwbn/ Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop Anthropic has released Claude Opus 4.5, claiming an industry-leading 80.9% coding score and introducing anthropic launchesclaude opusswe benchscoreprice https://www.swebench.com/index.html SWE-bench Leaderboards swe benchleaderboards https://www.tudingai.com/sitetag/swe-bench-pro SWE-Bench Pro | 图钉AI导航 图钉AI导航一个专注专注收录优质上百款免费AI工具的导航网站,包括AI写作工具、AI绘画修图工具、AI视频音频工具、AI写代码编程工具、以及其他一些交流社区和开放平台,都经过了作者精心筛选,拿来就能用!除了AI产品的分享,网站内还包含了AI相关资讯以及AI使用教程。 swe benchpro https://iclr.cc/virtual/2025/poster/28177 ICLR Poster SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? swe benchai systemsiclrpostermultimodal https://www.openaitoolshub.org/en/blog/gpt-5-4-developer-review GPT-5.4 for Developers: API Pricing, Computer Use, and SWE-bench 80% | OpenAIToolsHub Mar 20, 2026 - GPT-5.4 developer review covering the $2.50/$15 API pricing (half Claude Opus cost), 1M context window, 80% SWE-bench, 75% OSWorld computer use, and... api pricingcomputer useswe benchgptdevelopers https://www.swebench.com/multilingual-leaderboard.html SWE-bench Multilingual swe benchmultilingual https://www.swebench.com/press.html SWE-bench Press swe benchpress https://huggingface.co/SWE-bench SWE-bench (SWE-bench) Org profile for SWE-bench on Hugging Face, the AI community building the future. swebench https://arxiv.org/abs/2602.08316 [2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding Abstract page for arXiv paper 2602.08316: SWE Context Bench: A Benchmark for Context Learning in Coding swecontextbenchlearningcoding https://www.swebench.com/SWE-bench/guides/docker_setup/ Docker Setup - SWE-bench dockersetupswebench https://www.morphllm.com/comparisons/cursor-alternatives Cursor Alternatives (2026): We Tested 7 Tools and the $0 One Scored 80.8% on SWE-bench We tested 7 Cursor alternatives on real codebases. The free option scored 80.8% SWE-bench Verified. The $10/mo option runs 3 agents simultaneously. Full... cursor alternativestestedtoolsonescored https://www.swebench.com/SWE-bench/ Overview - SWE-bench overviewswebench