hahadsg's note

Follow me on GitHub

极大似然估计解释

\(\hat{y}\)是给定\(x\)对\(y=1\)的估计: \(\hat{y} = P(y=1\rvert x)\)

所以,\(1-\hat{y} = P(y=0\rvert x)\)

归纳起来,\(P(y\rvert x)=\hat{y}^y (1-\hat{y})^{1-y}\)

则\(log(P(y\rvert x))=y*log(\hat{y})+(1-y)*log(1-\hat{y})\)

那对于所有的样本,用极大似然估计:

\[L=\prod P(y^{(i)}\rvert x^{(i)})\]

取对数似然,

\[l=log(\prod P(y^{(i)}\rvert x^{(i)}))=\sum y*log(\hat{y})+(1-y)*log(1-\hat{y})\]

由于要最大化\(l\),那我们的损失函数要加上负号,

\[J=-\sum y*log(\hat{y})+(1-y)*log(1-\hat{y})\]

求导

\[z = Wx + b\] \[\hat{y} = \sigma(z) = \sigma(Wx + b)\] \[J = -y \bullet log(\hat{y}) - (1-y) \bullet log(1-\hat{y})\] \[\frac{\partial J}{\partial \hat{y}} = \frac{\hat{y} - y}{\hat{y}(1-\hat{y})}\] \[\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})\] \[\frac{\partial J}{\partial z} = \frac{\partial J}{\partial \hat{y}} \bullet \frac{\partial \hat{y}}{\partial z} = \hat{y} - y\] \[\frac{\partial J}{\partial W} = \frac{\partial J}{\partial z} \bullet \frac{\partial z}{\partial W} = (\hat{y} - y)x\] \[\frac{\partial J}{\partial b} = \frac{\partial J}{\partial z} \bullet \frac{\partial z}{\partial b} = \hat{y} - y\]

Softmax

\[\begin{aligned} &\hat{y_j} = \frac{e^{z_j}}{\sum\limits^K_k e^{z_k}} \\ &L = -\sum\limits^M_i \sum\limits^K_k 1\{y^{(i)}=k\} ln\hat{y}^{(i)} \\ \end{aligned}\]

对于单个样本,第j类求导:

\[\begin{aligned} &\frac{ \partial{L_i} }{ \partial{\hat{y_j}} } = \frac{1}{ \hat{y_j} } \\ &\frac{ \partial{\hat{y_j}} }{ \partial{z_j} } = \frac{ \partial }{ \partial{z_j} } \frac{ e^{z_j} }{ \sum\limits^K_k e^{z_k} } \\ &\qquad = \frac{ e^{z_j} . \sum\limits^K_k e^{z_k} - e^{z_j} . e^{z_j} }{ {(\sum\limits^K_k e^{z_k})}^2 } \\ &\qquad = \hat{y_j} . (1 - \hat{y_j}) \\ &\frac{ \partial{L_i} }{ \partial{z_j} } = 1 - \hat{y_j} \\ \end{aligned}\]

Softmax and Sigmoid

Sigmoid: \(\hat{y} = \frac{1}{1+e^{-z}}\)

Softmax: \(\hat{y_j} = \frac{e^{z_j}}{\sum e^{z_k}}\)

对于Softmax类别数量为2的情况,

\[\hat{y_1} = \frac{e^{z_1}}{e^{z_0} + e^{z_1}} = \frac{1}{e^{z_0-z_1} + 1}\]

所以,Sigmoid是Softmax类别数量为2的特殊情况,但是Softmax要比Sigmoid浪费2倍的参数空间

logstic loss的形式

label是0和1时,\(L = -ylog\hat{y} -(1-y)log(1-\hat{y})\)

label是-1和1时,\(\hat{y} = \sigma(y_r), L = -log(\sigma(yy_r))\),或者\(L = log(1+e^{-yy_r})\)

作为对比,hinge loss(label是-1和1)是\(L = max(1-y\hat{y}, 0)\)