Skywork AI

Core Contributors: Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Xuchen Song, Yang Liu

Contributors: Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Yahui Zhou

</aside>

TL;DR: Software engineering (SWE) has recently emerged as a crucial testbed for next-genera tion LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model's performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SoTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SoTA results for sub-32B parameter models. Finally, we distill a set of practical guidelines aimed at further advancing LLM-driven software engineering in both academic research and industrial practice.

Hugging Face: https://huggingface.co/Skywork/Skywork-SWE-32B

Paper📚 : arxiv.org

Table of Contents

Introduction

Two core capabilities define the emerging potential of Large Language Model (LLM) agents: the ability to engage in multi-turn interactions and to reason over long-context inputs. Among real-world applications, software engineering (SWE) tasks, which involve localizing bugs, modifying source codes, and validating fixes on real-world software issues collected from GitHub, stand out as a critical evaluation domain. Benchmarking datasets like SWE-bench [1] and SWE-bench Verified [11] reflect the growing interest and challenges in LLM-driven SWE. However, existing datasets face several limitations:

<aside> ⚠️

Insufficient environment and validation support: Existing benchmarks typically lack comprehensive mechanisms for configuring executable runtime environments or standardized code execution suites to systematically validate the generated code patches across diverse repositories, leading to inconsistent and non-reproducible evaluations.
Scarcity of high-quality training data: Few existing datasets provide rigorously validated and high-quality training instances, causing open-source LLMs to underperform compared to proprietary models.
Unclear applicability of data scaling laws: The volume of training data for SWE tasks is notably smaller than that in other LLM domains, raising questions about whether data scaling laws still hold in SWE tasks. </aside>

To address these challenges, we propose Skywork-SWE, a novel approach that tackles the limitations of current SWE datasets and agent models. Our contributions are:

<aside> 🛠

We introduce an efficient and automated pipeline for SWE data collection, resulting in the Skywork-SWE dataset, a large-scale, high-quality dataset featuring comprehensive executable runtime environments.
We release Skywork-SWE-32B, a powerful open-source code agent model tailored for SWE tasks, establishing a new performance benchmark among same-scale open-source SWE agents.
We empirically observe the data scaling law in SWE tasks, demonstrating consistent performance improvements with increased training data size. This validates the applicability of scaling laws within software engineering and underscores the need for larger, well-curated SWE datasets. </aside>

Figure 1. Performance comparison among recent advanced approaches using OpenHands [6] on SWE-bench Verified. With the incorporation of test-time scaling (TTS) techniques, Skywork-SWE-32B achieves 47.0% accuracy, surpassing all the peers.

Automated Data Curation Pipeline

To address the limitations of existing SWE datasets, we developed an efficient and automated data curation pipeline that systematically scales both the volume and diversity of our Skywork-SWE dataset. This pipeline ensures high-quality, rigorously validated training instances by combining broad coverage of GitHub repositories with robust reproducibility. Our three-stage data collection pipeline, along with the data flow across four key hierarchical filtering steps, is illustrated in Figure 2 and Figure 3, respectively.

Figure 2. Overview of the three-stage Skywork-SWE data collection pipeline.

Figure 3. Visualization of data flow across four key hierarchical filtering steps in our data collection pipeline. The first three steps belong to the Data Collection & Pre-filtering stage in Figure 2, while the last step corresponds to the Environment Setup & Execution-based Validation stage.

A. Data Collection & Pre-filtering

This initial stage focuses on gathering and refining raw data from GitHub. It involves: