swe-bench - Sean Kim — Arts and Tech

February 3, 2026

Published by Sean Kim on February 3, 2026

Categories

AI Tools & Services

Claude Opus 4.6 vs GPT-5.1 vs Gemini 3.5: The February 2026 Benchmark Battle That Changes Everything

Seven major model releases in a single month. That’s what February 2026 just delivered — and the AI landscape has never looked this competitive. Claude Opus […]

November 5, 2025

Published by Sean Kim on November 5, 2025

Categories

Programming & Development

GPT-5.1-Codex-Max: 5 Breakthroughs Behind OpenAI’s 80% SWE-Bench AI Coding Model

80% accuracy on SWE-Bench. 24 hours of continuous coding. Zero degradation. GPT-5.1-Codex-Max is OpenAI’s most ambitious agentic coding model to date, launched on November 20, 2025, […]

October 30, 2025

Published by Sean Kim on October 30, 2025

Categories

AI Tools & Services

Claude Haiku 4.5 Release: Fastest Claude Model Gets Quality Boost with 73.3% SWE-Bench Score

A $1-per-million-token model just matched the coding quality of a model that costs three times more. Claude Haiku 4.5, released on October 15, 2025, isn’t just […]

September 30, 2025

Published by Sean Kim on September 30, 2025

Categories

AI Tools & Services

Claude Sonnet 4.5 Release: 77.2% SWE-bench Score and 30-Hour Autonomous Agents — What Changed

Anthropic just dropped Claude Sonnet 4.5, and the numbers speak for themselves: 77.2% on SWE-bench Verified, 61.4% on OSWorld, and agents that can stay focused for […]

September 2, 2025

Published by Sean Kim on September 2, 2025

Categories

AI Tools & Services

Claude Sonnet 4.5 Benchmark Deep Dive: 77.2% SWE-bench Crushes GPT-5 and Gemini

77.2% on SWE-bench Verified. That single number just rewrote the rules of the AI coding model market. Anthropic’s Claude Sonnet 4.5 benchmark results don’t just represent […]

September 1, 2025

Published by Sean Kim on September 1, 2025

Categories

AI Tools & Services

Claude Sonnet 4.5 Release: 77.2% SWE-Bench Score, 30-Hour Autonomous Coding, and Why Developers Are Switching

Anthropic just mass-deployed its most dangerous weapon in the AI coding wars — and it costs exactly the same as the model it replaces. Claude Sonnet […]

August 8, 2025

Published by Sean Kim on August 8, 2025

Categories

AI Tools & Services

Claude Opus 4.1: Anthropic’s Sharpest Coding Model Scores 74.5% on SWE-bench

Claude Opus 4.1 just dropped three days ago, and the benchmark numbers are telling a story that every developer building on AI should pay attention to […]

August 7, 2025

Published by Sean Kim on August 7, 2025

Categories

AI Tools & Services

GPT-5 SWE-Bench Coding Performance Hits 74.9% — But Real-World Tests Tell a Different Story

SWE-Bench Verified 74.9%. Aider Polyglot 88%. Multi-file refactoring 91%. Looking at GPT-5’s coding benchmarks alone, you’d think OpenAI just cracked the code on AI-assisted development. But […]

August 6, 2025

Published by Sean Kim on August 6, 2025

Categories

AI Tools & Services

GPT-5 vs Claude 3.5 Sonnet vs Gemini 2.5 Pro: The August 2025 Benchmark Showdown

OpenAI just dropped GPT-5, and the benchmark wars have never been this close. As someone who has spent 28 years in music and audio technology — […]

June 6, 2025

Published by Sean Kim on June 6, 2025

Categories

AI Tools & Services

Claude 3.5 Sonnet Agentic Coding: How 49% on SWE-bench Rewrote the Rules for AI Developer Tools

A year ago, most developers treated AI coding assistants as glorified autocomplete. Then Anthropic dropped a model that scored 49% on SWE-bench Verified — solving nearly […]