Rylan Schaeffer

Kernel Papers

14 February 2023

Chinchilla Scaling Laws (Paper Notes)

by Rylan Schaeffer

Main Claims

Use science to pick optimal model size for our budget

\[N_{opt}(C), D_{opt}(C) = \arg \min_{N, D s.t. FLOPs(N,D) = C} L(N, D)\] \[N_{opt} \propto C^a\] \[D_{opt} \propto C^b\]

where \(a = 1 - b\). This is because \(C \approx 6 N D\)

Q: Why use 3 different approaches?

Q: Why Kaplan is so different?

Q: What are the 3 different approaches:

  1. Vary D for various fixed N

  1. For FLOP budgets, vary N, fixed C, find model with minimal loss

Note: some sort of suspicious curve bending for larger FLOPs

  1. Fit a parametric loss function

Q: Why is (3) different from (1) and (2)?

The Huber loss is robust to outliers, and the smaller models are outliers. Is paying more attention to the really big runs done at the end.

Loss curves start bending: fit using smallest runs (blue), fit using medium runs (teal), fit using largest runs (red). The curve is flattening.

It’s weird that they posit a fit and say the fit is great, then say the fit isn’t great because of outliers and then propose another fit.

Also, these confidence intervals are very narrow.


Can we get all the data points they plotted?

Q: Given a fixed FLOPs budget, where is the tradeoff between fitting scaling law & then training the big model?

tags: machine-learning - scaling-laws - 2023