Scaling Open-Source Language Models with Longtermism
https://arxiv.org/pdf/2401.02954#page=16.14
Considering the redundancy in the parameter space, we regarded the parameters used by models whose generalization error exceeded the minimum by no more than 0.25% as near-optimal hyperparameters. We then fitted the batch size 𝐵 and learning rate 𝜂 with respect to the compute budget 𝐶.

先观察到超参数的可行域很大(左图),再规定 0.25% 阀值计算对应的 scaling law,最终验证(右图)

规定了 M 得到更好的公式 C=MD, 具体推导见 https://zhuanlan.zhihu.com/p/667489780

可见原来的 6N1 6N2 的误差还是很大

8个 budget * ~10 model-data design 超参数按照超参数 scaling law 确定
The generalization error was calculated on an independent validation set, distributed similarly to the training set and containing 100M tokens

