OpenAI’s SWE-Lancer Benchmark

February 21, 2025

6

The institution of benchmarks that faithfully replicate real-world duties is important within the quickly creating area of synthetic intelligence, particularly within the software program engineering area. Samuel Miserendino and associates developed the SWE-Lancer benchmark to evaluate how effectively massive language fashions (LLMs) carry out freelancing software program engineering duties. Over 1,400 jobs totaling $1 million USD have been taken from Upwork to create this benchmark, which is meant to guage each managerial and particular person contributor (IC) duties.

What’s SWE-Lancer Benchmark?

SWE-Lancer encompasses a various vary of duties, from easy bug fixes to complicated function implementations. The benchmark is structured to offer a practical analysis of LLMs by utilizing end-to-end exams that mirror the precise freelance evaluation course of. The duties are graded by skilled software program engineers, making certain a excessive commonplace of analysis.

Options of SWE-Lancer

Actual-World Payouts: The duties in SWE-Lancer characterize precise payouts to freelance engineers, offering a pure problem gradient.
Administration Evaluation: The benchmark chooses the very best implementation plans from unbiased contractors by assessing the fashions’ capability to function technical leads.
Superior Full-Stack Engineering: As a result of complexity of real-world software program engineering, duties necessitate an intensive understanding of each front-end and back-end growth.
Higher Grading via Finish-to-Finish Checks: SWE-Lancer employs end-to-end exams developed by certified engineers, offering a extra thorough evaluation than earlier benchmarks that relied on unit exams.

Why is SWE-Lancer Necessary?

An important hole in AI analysis is crammed by the launch of SWE-Lancer: the capability to evaluate fashions on duties that replicate the intricacies of actual software program engineering jobs. The multidimensional character of real-world initiatives shouldn’t be adequately mirrored by earlier requirements, which regularly targeting discrete duties. SWE-Lancer provides a extra sensible evaluation of mannequin efficiency by using precise freelance jobs.

Analysis Metrics

The efficiency of fashions is evaluated primarily based on the share of duties resolved and the overall payout earned. The financial worth related to every job displays the true problem and complexity of the work concerned.

Instance Duties

$250 Reliability Enchancment: Fixing a double-triggered API name.
$1,000 Bug Repair: Resolving permissions discrepancies.
$16,000 Characteristic Implementation: Including assist for in-app video playback throughout a number of platforms.

The SWE-Lancer dataset accommodates 1,488 real-world freelance software program engineering duties, drawn from the Expensify open-source repository and initially posted on Upwork. These duties, with a mixed worth of $1 million USD, are categorized into two teams:

Particular person Contributor (IC) Software program Engineering (SWE) Duties

This dataset consists of 764 software program engineering duties, price a complete of $414,775, designed to characterize the work of particular person contributor software program engineers. These duties contain typical IC duties equivalent to implementing new options and fixing bugs. For every job, a mannequin is supplied with:

An in depth description of the difficulty, together with copy steps and the specified conduct.
A codebase checkpoint representing the state earlier than the difficulty is mounted.
The target of fixing the difficulty.

The mannequin’s proposed resolution (a patch) is evaluated by making use of it to the offered codebase and operating all related end-to-end exams utilizing Playwright. Critically, the mannequin doesn’t have entry to those end-to-end exams through the resolution era course of.

Analysis circulate for IC SWE duties; the mannequin solely earns the payout if all relevant exams go.

SWE Administration Duties

This dataset, consisting of 724 duties valued at $585,225, challenges a mannequin to behave as a software program engineering supervisor. The mannequin is introduced with a software program engineering job and should select the very best resolution from a number of choices. Particularly, the mannequin receives:

A number of proposed options to the identical subject, taken straight from actual discussions.
A snapshot of the codebase because it existed earlier than the difficulty was resolved.
The general goal in selecting the right resolution.

The mannequin’s chosen resolution is then in contrast in opposition to the precise, ground-truth finest resolution to guage its efficiency. Importantly, a separate validation examine with skilled software program engineers confirmed a 99% settlement price with the unique “finest” options.

Analysis circulate for SWE Supervisor duties; throughout proposal choice, the mannequin has the power to browse the codebase.

Additionally Learn: Andrej Karpathy on Puzzle-Fixing Benchmarks

Mannequin Efficiency

The benchmark has been examined on a number of state-of-the-art fashions, together with OpenAI’s GPT-4o, o1 and Anthropic’s Claude 3.5 Sonnet. The outcomes point out that whereas these fashions present promise, they nonetheless battle with many duties, significantly these requiring deep technical understanding and context.

Efficiency Metrics

Claude 3.5 Sonnet: Achieved a rating of 26.2% on IC SWE duties and 44.9% on SWE Administration duties, incomes a complete of $208,050 out of $500,800 potential on the SWE-Lancer Diamond set.
GPT-4o: Confirmed decrease efficiency, significantly on IC SWE duties, highlighting the challenges confronted by LLMs in real-world purposes.
GPT o1 mannequin: Confirmed a mid efficiency earned over $380 and carried out higher than 4o.

Whole payouts earned by every mannequin on the total SWE-Lancer dataset together with each IC SWE and SWE Supervisor duties.

Consequence

This desk reveals the efficiency of various language fashions (GPT-4, o1, 3.5 Sonnet) on the SWE-Lancer dataset, damaged down by job sort (IC SWE, SWE Supervisor) and dataset dimension (Diamond, Full). It compares their “go@1” accuracy (how usually the highest generated resolution is appropriate) and earnings (primarily based on job worth). The “Person Software” column signifies whether or not the mannequin had entry to exterior instruments. “Reasoning Effort” displays the extent of effort allowed for resolution era. Total, 3.5 Sonnet typically achieves the best go@1 accuracy and earnings throughout totally different job sorts and dataset sizes, whereas utilizing exterior instruments and rising reasoning effort tends to enhance efficiency. The blue and inexperienced highlighting emphasizes general and baseline metrics respectively.

The desk shows efficiency metrics, particularly “go@1” accuracy and earnings. Total metrics for the Diamond and Full SWE-Lancer units are highlighted in blue, whereas baseline efficiency for the IC SWE (Diamond) and SWE Supervisor (Diamond) subsets are highlighted in inexperienced.

Limitations of SWE-Lancer

SWE-Lancer, whereas precious, has a number of limitations:

Variety of Repositories and Duties: Duties have been sourced solely from Upwork and the Expensify repository. This limits the analysis’s scope, significantly infrastructure engineering duties, that are underrepresented.
Scope: Freelance duties are sometimes extra self-contained than full-time software program engineering duties. Though the Expensify repository displays real-world engineering, warning is required when generalizing findings past freelance contexts.
Modalities: The analysis is text-only, missing consideration for a way visible aids like screenshots or movies may improve mannequin efficiency.
Environments: Fashions can’t ask clarifying questions, which can hinder their understanding of job necessities.
Contamination: The potential for contamination exists as a result of public nature of duties. To make sure correct evaluations, looking needs to be disabled, and post-hoc filtering for dishonest is important. Evaluation signifies restricted contamination impression for duties predating mannequin information cutoffs.

Future Work

SWE-Lancer presents a number of alternatives for future analysis:

Financial Evaluation: Future research may examine the societal impacts of autonomous brokers on labor markets and productiveness, evaluating freelancer payouts to API prices for job completion.
Multimodality: Multimodal inputs, equivalent to screenshots and movies, are usually not supported by the present framework. Future analyses that embrace these elements could provide a extra thorough appraisal of the mannequin’s efficiency in sensible conditions.

You could find the total analysis paper right here.

Conclusion

SWE-Lancer represents a major development within the analysis of LLMs for software program engineering duties. By incorporating real-world freelance duties and rigorous testing requirements, it supplies a extra correct evaluation of mannequin capabilities. The benchmark not solely facilitates analysis into the financial impression of AI in software program engineering but additionally highlights the challenges that stay in deploying these fashions in sensible purposes.

Harsh Mishra is an AI/ML Engineer who spends extra time speaking to Giant Language Fashions than precise people. Enthusiastic about GenAI, NLP, and making machines smarter (in order that they don’t substitute him simply but). When not optimizing fashions, he’s in all probability optimizing his espresso consumption. 🚀☕

OpenAI’s SWE-Lancer Benchmark

What’s SWE-Lancer Benchmark?

Options of SWE-Lancer

Why is SWE-Lancer Necessary?

Analysis Metrics

Instance Duties

Particular person Contributor (IC) Software program Engineering (SWE) Duties

SWE Administration Duties

Mannequin Efficiency

Efficiency Metrics

Consequence

Limitations of SWE-Lancer

Future Work

Conclusion

Related Articles

Write Warz builders on how inventive writing makes a hilarious get together sport

Liquid crystal technique permits large-scale manufacturing of uniform perovskite nanocrystals

Enhance search outcomes for AI utilizing Amazon OpenSearch Service as a vector database with Amazon Bedrock

LEAVE A REPLY Cancel reply

Latest Articles

Write Warz builders on how inventive writing makes a hilarious get together sport

Liquid crystal technique permits large-scale manufacturing of uniform perovskite nanocrystals

Enhance search outcomes for AI utilizing Amazon OpenSearch Service as a vector database with Amazon Bedrock

DeceptiveDevelopment targets freelance builders

The Open Platform Mandate | Databricks Weblog