Achieving SOTA on SWE-bench

19 May 2025 / Blog
Author: Yawar Aziz

Our agent improved from 54.2% to 70.2% on SWE-bench Verified through iterative enhancements including regression testing, self-critique, and multi-LLM collaboration. This post shares our approach and learnings from each milestone.

SWE-bench is a benchmark to evaluate AI agents on real-world software engineering tasks. Each task involves resolving a real issue in popular GitHub repositories using human-verified test cases. A solution is considered successful if a set of unit tests pass. More details can be found at swebench.com.

We made three key submissions to SWE-bench, each representing a significant improvement in our approach. Each iteration addressed specific limitations we identified and incorporated new techniques to enhance performance.

May 2025 (70.2%) -> Multi-LLM Collaboration

In this iteration, we confronted a key limitation: the performance ceiling of single-model agents. No matter how well a model is tuned or prompted, its perspective is bounded. To break through, we designed a hybrid strategy that leverages multiple LLMs.

Roughly, the process was as follows:

devlo's base agent handled initial repo exploration and issue reproduction
It then sent key inputs to three different LLMs - each producing an independent solution.
The base agent assessed each candidate (or combined them) to produce the final patch.
Regression testing was again used to validate correctness.

Why multiple LLMs help?

Different LLMs tend to have non-overlapping training data, inductive biases, and reasoning habits. What one model might miss or misinterpret, another may surface correctly. This diversity of thought allows the agent to triangulate a more accurate solution. In some cases, two models may independently propose similar fixes, giving the agent more confidence in those outcomes (akin to a soft majority vote, though not explicitly hardcoded).

In other scenarios, the agent can intelligently merge multiple approaches, resolving weaknesses or blind spots in any one model's output. This ensemble-style strategy broadens the solution space and improves reliability under a wide range of issue types.

Here is a log from one of the runs which shows the multi-LLM collaboration in action:

django__django-15863.log

Let's analyze the suggestions and formulate a plan to fix the issue:

### Understanding the Problem

The issue is that the 'floatformat' template filter in Django loses precision when working with Decimal numbers. This happens because when the initial conversion to Decimal fails, it tries to convert the value to a float first, then to a Decimal - which loses precision due to the limited precision of floating-point numbers.

### Common Theme in All Suggestions

All suggestions propose checking whether the input is already a Decimal object, and if so, using it directly instead of trying to convert it.

Let's analyze each suggestion:

1. **Suggestion 1**: Adds a check for Decimal type and uses it directly if found. Maintains the current behavior for all other types.

2. **Suggestion 2**: Similar to the first suggestion, but structures the code differently by checking for Decimal type first and keeping the variable 'input_val = repr(text)' for consistent usage.

3. **Suggestion 3**: A more comprehensive solution that handles different types specifically. It treats floats differently from other types (using 'str(text)' for float conversion to Decimal).

All three suggestions are reasonable, but I prefer a combination of aspects from suggestions 2 and 3:

- Check if 'text' is already a Decimal first (from suggestion 2)
- Handle floats specifically, converting them to strings first for more accurate decimal representation (from suggestion 3)
- Simplify the exception handling

Let's implement this approach.

This strategy enabled devlo to synthesize diverse perspectives efficiently, without bloating inference cost. The result: a new state-of-the-art 70.2% on SWE-bench Verified.

Dec 2024 (58.2%) -> Checkpoints, Self-Critique, and Image Support

We realized that the model frequently went off course, particularly during longer tasks, often getting trapped in inefficient solution paths. To address this, we enhanced our approach by introducing strategic mechanisms:

Conversation Checkpoints: Regular checkpoints were embedded within tasks to help the model stay anchored. These acted as milestones to periodically reassess the direction of problem-solving.
Self-Criticism: The agent was prompted to critically reflect on its ongoing attempts. Leveraging its awareness of it's increased learning of the problem as it was trying to resolve it, it could effectively pivot and try alternative solutions when a particular approach became unproductive.
Image Interpretation: We enabled basic visual processing capabilities, allowing the agent to interpret screenshots that provided additional context. This was useful in repositories like matplotlib where the issue was sometimes described with a screenshot of the error.

Together, these refinements improved consistency and helped devlo acheive 58.2%, another state-of-the-art score, though short-lived.

Nov 2024 (54.2%) -> Baseline + Regression Testing

We began by adapting the structured pipeline popularized by Anthropic's Sonnet v2 framework in this blog post. Their approach emphasized a self-directed, exploratory agent workflow that could navigate the codebase to understand the structure and identify relevant files, reproduce the issue and implement a fix until the issue was resolved, while also handling edge cases. This represented a paradigm shift from lengthy, highly specific prompt engineering to a more self-driven agent approach. In fact, providing high-level instructions proved far more effective, allowing the model the autonomy to reason freely and find optimal solutions independently. Basically, we had to get out of the agent's way.

However, we introduced an important enhancement to the approach: regression testing.

Regression testing is an engineering best practice to make sure code changes do not break other parts of the codebase and is a core component of CI/CD pipelines. This is even more relevant to the benchmark since it measures performance through unit tests (PASS_TO_PASS and FAIL_TO_PASS). Specifically, this approach helps with PASS_TO_PASS tests i.e. tests that already pass before the fix is applied should continue to pass after the solution is applied. As far as we know, this was the first time regression testing was integrated into a SWE-bench submission. It later became a standard part of the submissions that followed.

Process: After confirming the specific fix, the agent identified and ran relevant unit test files, checking for unintended side effects or regressions introduced by its changes. If unexpected issues arose during these tests, the agent iteratively refined its solutions until achieving stability. The enhancement significantly contributed to our initial strong performance, placing devlo near the top of the leaderboard with a score of 54.2%, a state-of-the-art score at the time.