https://github.com/SWE-bench/experiments/pull/397
Add results for SWE-Bench Lite for Potpie AI by dhirenmathur · Pull Request #397 ·...
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task. - Add results for SWE-Bench...
swe benchpotpie aipull requestaddresults
https://huggingface.co/SWE-bench/SWE-agent-LM-32B
SWE-bench/SWE-agent-LM-32B · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
swe benchagentlmhuggingface
https://epoch.ai/blog/swebench-docker
How to run SWE-bench Verified in one hour on one machine | Epoch AI
We are releasing a public registry of optimized Docker images for SWE-bench. This allows us to run SWE-bench Verified in 62 minutes on a single GitHub actions...
swe benchone hourrunverifiedmachine
https://arxiv.org/abs/2310.06770
[2310.06770] SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Abstract page for arXiv paper 2310.06770: SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
swe benchlanguage modelsreal worldresolvegithub
https://www.swebench.com/multimodal.html
SWE-bench Multimodal
swe benchmultimodal
https://refact.ai/blog/2025/updates-for-may-2025/
May Updates: Top open-source AI Agent on SWE-bench; What sparked CTOs' interest at Dublin Tech...
May Updates: Top open-source AI Agent on SWE-bench; What sparked CTOs' interest at Dublin Tech Summit?
top open sourceai agentswe benchmayupdates
https://huggingface.co/collections/SWE-bench/swe-bench
SWE-bench - a SWE-bench Collection
SWE-bench (Lite, Verified, Multimodal, Multilingual) all in one place!
swe benchcollection
https://www.swebench.com/
SWE-bench Leaderboards
swe benchleaderboards
https://www.tudingai.com/sites/3310.html
SWE-Bench Pro - 新一代软件工程 AI 基准测试集。AI 编程领域的“图灵测试” | 图钉AI导航
swe benchproai
https://refact.ai/blog/2025/1-agent-on-swe-bench-verified-using-claude-4-sonnet/
Refact.ai Agent achieved leading results on SWE-bench Multimodal and Verified - Refact.ai
Refact.ai Agent achieved leading results on SWE-bench Multimodal and Verified
refact aiswe benchagentachievedleading
https://www.swebench.com/contact.html
Contact SWE-bench Team
swe benchteam
https://automatio.ai/models/kimi-k2-6
Kimi k2.6: 1T MoE Model with 80.2% SWE-Bench Score
Kimi k2.6 is Moonshot AI's 1T-parameter MoE model featuring a 256K context window, native video input, and elite performance in autonomous agentic coding.
swe benchkimimoemodelscore
https://nexu.io/blog/qwen-3-6-35b-a3b-open-source-moe
Qwen 3.6-35B-A3B Is Open: 3B Active Params, 73.4% SWE-bench, Drops Into nexu Tonight — nexu
Apr 20, 2026 - Alibaba open-sourced Qwen3.6-35B-A3B on April 16: a sparse MoE with only 3B active parameters that scores 73.4% on SWE-bench Verified and 1M-token context with...
swe benchqwenopenactiveparams
https://www.swebench.com/verified.html
SWE-bench Verified
swe benchverified
https://bito.ai/benchmarks/swe-bench-pro-evaluation/
AI Architect tops SWE-Bench Pro | 35% higher task success | Bito
Apr 24, 2026 - A benchmark-based evaluation of how deep system context boosts coding agent success by 35% on long-horizon tasks in large, real-world codebases.
ai architectswe benchtopsprohigher
https://www.marc0.dev/en/leaderboard
SWE-Bench Leaderboard May 2026 | GPT-5.5 Leads at 88.7%
swe benchleaderboardmaygptleads
https://www.ai21.com/blog/scaling-agentic-evaluation-swe-bench/
Agentic Evaluation: Lessons from 200,000 SWE-bench Runs
Mar 25, 2026 - How we scaled agentic evaluation to 200,000 SWE-bench runs. Infrastructure design for isolation, throughput, and resumable execution.
swe benchagenticevaluationlessonsruns
https://conf.researchr.org/details/icse-2026/icse-2026-software-engineering-in-practice/29/The-SWE-Bench-Illusion-When-State-of-the-Art-LLMs-Remember-Instead-of-Reason
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason (ICSE 2026 - Software...
Call for papers The Software Engineering in Practice (SEIP) track of ICSE is the premier venue for practitioners and researchers to discuss insights,...
swe benchillusionstateartllms
https://evolink.ai/claude-opus-4-1
Claude Opus 4.1 API: 74.5% SWE-bench, Agentic Coding | EvoLink
Claude Opus 4.1 API — Anthropic's recommended upgrade from Opus 4. 74.5% SWE-bench Verified, multi-file refactoring, and deep reasoning. Access via EvoLink...
claude opusswe benchagentic codingapievolink
https://www.morphllm.com/swe-bench-pro
SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81%
Live SWE-Bench Pro rankings with SEAL scores, agent systems, and Verified. The best model scores 46% on Pro but 81% on Verified, because Verified is...
swe benchproleaderboardbeats
https://www.swebench.com/blog.html
SWE-bench Blog
swe benchblog
https://llm-stats.com/benchmarks/swe-bench-verified
SWE-Bench Verified Leaderboard
Jun 3, 2026 - SWE-Bench Verified leaderboard — Claude Mythos Preview leads 92 AI models at 0.939. A verified subset of 500 software engineering problems from real GitHub iss…
swe benchverifiedleaderboard
https://www.swebench.com/lite.html
SWE-bench Lite
swe benchlite
https://automatio.ai/models/qwen3-6-max-preview
Qwen3.6-Max-Preview: 1M Context & Top SWE-Bench Scores
Qwen3.6-Max-Preview is Alibaba's flagship MoE model featuring 1M context, a native thinking mode, and SOTA scores in agentic coding and reasoning.
swe benchmaxpreviewcontexttop
https://www.augmentcode.com/blog/auggie-tops-swe-bench-pro
Auggie tops SWE-Bench Pro | Augment Code
Feb 4, 2026 - The most powerful AI software development platform with the industry-leading context engine.
swe benchtopsproaugmentcode
https://www.vals.ai/benchmarks/swebench
SWE-bench Verified
Private, domain-specific benchmarks in legal, tax, and finance.
swe benchverified
https://www.aitags.cn/sites/1380.html
KAT Coder - 快手Kwaipilot出品的先进AI编码助手 | 73.4% SWE-Bench | AI标签页
Oct 24, 2025 - KAT Coder 是快手Kwaipilot团队研发的旗舰AI编码模型,基于先进的智能体强化学习和MoE架构,擅长自主完成复杂的软件工程任务。
swe benchkatcoder
https://www.openaitoolshub.org/en/blog/qwen-code-review
Qwen Code Review — Qwen CLI Features, Free Pricing, 69.6% SWE-bench | OpenAIToolsHub
Mar 20, 2026 - Hands-on Qwen Code review — Alibaba's open-source terminal coding agent, Gemini CLI fork, Qwen3-Coder 69.6% SWE-bench, model-agnostic, and completely free....
code reviewfeatures freeswe benchqwencli
https://openlm.ai/swe-bench/
SWE-bench + | OpenLM.ai
SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language...
swe benchopenlmai
https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate
What skills does SWE-bench Verified evaluate? | Epoch AI
We take a deep dive into SWE-bench Verified, a prominent agentic coding benchmark. While one of the best public tests of AI coding agents, it is limited by its...
swe benchskillsverifiedevaluateepoch
https://refact.ai/blog/2025/sota-on-swe-bench-lite-open-source-refact-ai/
Open-Source Refact.ai Agent is SOTA on SWE-bench Lite With a 60.0% Score - Refact.ai
Refact.ai Agent has achieved the #1 score on SWE-bench Lite — solving 179 out of 300 tasks, for a 60.0% success rate.
open sourcerefact aiswe benchagentsota
https://automatio.ai/models/claude-opus-4-7
Claude Opus 4.7: 1M Context & 87.6% SWE-bench Result
Claude Opus 4.7 is Anthropic's flagship model with a 1-million-token context, adaptive reasoning, and 3.3x vision resolution for enterprise-scale agents.
claude opusswe benchcontextresult
https://nomosinsights.com/blog/swe-bench-reasoning-annotation-learnings
SWE-Bench Reasoning Annotation: What We Learned from 500+ Trajectories - Nomos Insights Blog |...
Pass or fail only tells you if an AI agent solved a problem. It tells you nothing about how it reasoned, where it went wrong, or what made one agent...
swe benchreasoningannotationlearnedtrajectories
https://winbuzzer.com/2025/11/24/anthropic-launches-claude-opus-4-5-with-80-9-swe-bench-score-and-66-price-drop-xcxwbn/
Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop
Anthropic has released Claude Opus 4.5, claiming an industry-leading 80.9% coding score and introducing
anthropic launchesclaude opusswe benchscoreprice
https://www.swebench.com/index.html
SWE-bench Leaderboards
swe benchleaderboards
https://www.tudingai.com/sitetag/swe-bench-pro
SWE-Bench Pro | 图钉AI导航
图钉AI导航一个专注专注收录优质上百款免费AI工具的导航网站,包括AI写作工具、AI绘画修图工具、AI视频音频工具、AI写代码编程工具、以及其他一些交流社区和开放平台,都经过了作者精心筛选,拿来就能用!除了AI产品的分享,网站内还包含了AI相关资讯以及AI使用教程。
swe benchpro
https://iclr.cc/virtual/2025/poster/28177
ICLR Poster SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
swe benchai systemsiclrpostermultimodal
https://www.openaitoolshub.org/en/blog/gpt-5-4-developer-review
GPT-5.4 for Developers: API Pricing, Computer Use, and SWE-bench 80% | OpenAIToolsHub
Mar 20, 2026 - GPT-5.4 developer review covering the $2.50/$15 API pricing (half Claude Opus cost), 1M context window, 80% SWE-bench, 75% OSWorld computer use, and...
api pricingcomputer useswe benchgptdevelopers
https://www.swebench.com/multilingual-leaderboard.html
SWE-bench Multilingual
swe benchmultilingual
https://www.swebench.com/press.html
SWE-bench Press
swe benchpress
https://huggingface.co/SWE-bench
SWE-bench (SWE-bench)
Org profile for SWE-bench on Hugging Face, the AI community building the future.
swebench
https://arxiv.org/abs/2602.08316
[2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding
Abstract page for arXiv paper 2602.08316: SWE Context Bench: A Benchmark for Context Learning in Coding
swecontextbenchlearningcoding
https://www.swebench.com/SWE-bench/guides/docker_setup/
Docker Setup - SWE-bench
dockersetupswebench
https://www.morphllm.com/comparisons/cursor-alternatives
Cursor Alternatives (2026): We Tested 7 Tools and the $0 One Scored 80.8% on SWE-bench
We tested 7 Cursor alternatives on real codebases. The free option scored 80.8% SWE-bench Verified. The $10/mo option runs 3 agents simultaneously. Full...
cursor alternativestestedtoolsonescored
https://www.swebench.com/SWE-bench/
Overview - SWE-bench
overviewswebench