https://arxiv.org/pdf/2401.14196
composed of 87% source code, 10% English coderelated natural language corpus, and 3% code-unrelated Chinese natural language corpus.
Firstly, we filter out files with an average line length exceeding 100 characters or a maximum line length surpassing 1000 characters. (行长度控制)
Additionally, we remove files with fewer than 25% alphabetic characters. (文件)
Except for the XSLT programming language, we further filter out files where the string "<?xml version=" appeared in the first 100 characters.
For HTML files, we consider the ratio of visible text to HTML code. We retain files where the visible text constitutes at least 20% of the code and is no less than 100 characters.
For JSON and YAML files, which typically contain more data, we only keep files that have a character count ranging from 50 to 5000 characters. This effectively removes most data-heavy files.
转换成只依赖于前序文件的格式

To incorporate file path information, a comment indicating the file’s path is added at the beginning of each file.
The algorithm concludes by returning a list of these sorted sequences, and each sequence’s files are concatenated to form a single training sample.
使用 n-gram filtering 避免test 污染
设计两种 objective