https://arxiv.org/pdf/2401.14196

Data Collection

composed of 87% source code, 10% English coderelated natural language corpus, and 3% code-unrelated Chinese natural language corpus.

GitHub Data Crawling and Filtering

Firstly, we filter out files with an average line length exceeding 100 characters or a maximum line length surpassing 1000 characters. (行长度控制)

Additionally, we remove files with fewer than 25% alphabetic characters. (文件)

Except for the XSLT programming language, we further filter out files where the string "<?xml version=" appeared in the first 100 characters.

For HTML files, we consider the ratio of visible text to HTML code. We retain files where the visible text constitutes at least 20% of the code and is no less than 100 characters.

For JSON and YAML files, which typically contain more data, we only keep files that have a character count ranging from 50 to 5000 characters. This effectively removes most data-heavy files.

Dependency Parsing

转换成只依赖于前序文件的格式

Screenshot 2025-02-04 at 16.28.07.png

To incorporate file path information, a comment indicating the file’s path is added at the beginning of each file.

The algorithm concludes by returning a list of these sorted sequences, and each sequence’s files are concatenated to form a single training sample.

Repo-Level Deduplication

Quality Screening and Decontamination

使用 n-gram filtering 避免test 污染

Training Policy

Training Strategy

设计两种 objective