Measuring Goodhart’s law

Let’s study best-of-n n sampling more formally. Suppose we have some sample space S S (such as the set of possible question-answer pairs), some probability distribution P P over S S , a true objective (or “reward”) Rtrue:S→R R_{\text{true}}:S\to\mathbb R , and a proxy objective Rproxy:S→R R_{\text{proxy}}:S\to\mathbb R. Let’s say that we somehow optimize Rproxy R_{\text{proxy}} and thereby obtain some new distribution P′ P^\prime . Then:

The expectation Ex′∼P′[Rtrue(x′)] \mathbb E_{x^\prime\sim P^\prime}\left[R_{\text{true}}\left(x^\prime\right)\right] measures how well we have optimized the true objective.
The KL divergence⁠(opens in a new window) DKL(P′∥P) D_{\text{KL}}\left(P^\prime\parallel P\right) measures how much optimization we have done. For example, if P′ P^\prime is obtained by taking the first sample from P P that lies in some subset S′⊆S S^\prime\subseteq S , then this KL divergence is just the negative log probability that a sample from P P lies in S′ S^\prime .

It turns out that in the case of best-of- n n sampling, both of these quantities can be estimated efficiently using samples from P P .

Let’s look at the expectation first. The naive approach is to use a Monte Carlo estimator: run best-of- n n sampling many times, measure the true objective on those samples, and average the results. However, there is a better estimator. If we have N≥n N\geq n samples from P P overall, then we can simultaneously consider every possible subset of these samples of size n n , weight each sample by the number of subsets for which it is the best according to the proxy objective, and then take the weighted average true objective score. This weight is just the binomial coefficient (k−1n−1) \binom{k-1}{n-1} , where k k is the rank of the sample under the proxy objective, from 1 1 (worst) up to N N (best).A

The sum of these weights is (Nn) \binom{N}{n} , giving a proof of the Hockey-stick identity⁠(opens in a new window). For a formal derivation of the estimator described here, see Appendix I of the WebGPT paper⁠(opens in a new window).

As well as using samples more efficiently, this also allows us to reuse samples for different values of n n . As for the KL divergence, surprisingly, this turns out to have an exact formula that works for any continuous probability distribution P P (i.e., as long as P P has no point masses). One might naively guess that the answer is log⁡n \log n , since best-of-n n is doing something like taking the top 1n \frac 1n of the distribution, and this is roughly correct: the exact answer is log⁡n−n−1n \log n-\frac{n-1}n . B

Together, these estimators allow us to easily analyze how the true objective varies with the amount of optimization applied to the proxy objective.

Here’s a real-life example from WebGPT⁠:

Footnotes

Hint: express the PDF of the best-of-n n distribution as a function of both the PDF and the CDF of the original distribution.

Best-of-nn is not necessarily optimal in the information-theoretic sense, however. For example, if PP has a heavy right tail⁠(opens in a new window), then for any x>0x>0 and any ε>0\varepsilon>0, there is a distribution QQ such that Ey∼Q[y]>x\mathbb E_{y\sim Q}\left[y\right]>x and DKL(Q∥P)<εD_{\text{KL}}\left(Q\parallel P\right)<\varepsilon (exercise).

Authors

Jacob Hilton, Leo Gao

Acknowledgments

Thanks to Suchir Balaji, Paul Christiano, William Guss, Vineet Kosaraju, John Schulman, Nisan Stiennon, Jeff Wu, and Daniel Ziegler for discussions related to the ideas in this post. Thanks to Greg Brockman, Jan Leike, Holly Mandel, John Schulman, and Jeff Wu for feedback on drafts. Thanks to Bianca Martin, Steve Dowling, Natalie Summers and Justin Jay Wang for communications and design.