Prior Correction
原始CTR为:
\[CTR = \sigma\left( \beta \right)\]校准后为:
\[CTR' = \sigma\left\{ \beta - ln\left[ \left(\frac{1-\tau}{\tau}\right) \left(\frac{\bar{y}}{1-\bar{y}}\right) \right]\right\} \approx \sigma\left[ \beta - ln\left( \frac{\bar{y}}{\tau}\right) \right]\]其中,\(\tau\)是真实正样本比例,\(\bar{y}\)是训练集中正样本比例
此方法认为CTR由于采样比例以及正负样本比例的原因引起误差,一般用于LR的CTR校准
参考:Logistic Regression in Rare Events Data
facebook用了此方法:Practical Lessons from Predicting Clicks on Ads at Facebook
Weighting
直接调整class weights
Isotonic Regression (保序回归)
流程(生成reliability diagram,运行Isotonic Regression):
- 准备一份验证集(为了防止过拟合)
- 将点击率[0, 1]的区间划分为N个桶,每个桶长度小童,比如\(N = 10^5\)
- 将点击率放入这些桶中
- 对于每一个桶,都计算其真实的点击率,做一个映射,这N个桶构成了reliability diagram
- 使用生成的reliability diagram运行Isotonic Regression(比如PAV算法)
pool adjacent violators algorithm (PAVA, PAV):
待补充
google用了此方法:Ad click prediction: a view from the trenches