Classification Problems
The Softmax
Purpose: Transforms a vector of real numbers into a probability distribution, where each value represents the probability of a specific class or outcome.
Usage:
Final activation function in neural networks for multi-class classification tasks.
Used with cross-entropy loss, a common choice for multi-class classification, which optimization functions aim to minimize.
Process:
Exponentiation: Applies the exponential function (e^x) to each element of the input vector, emphasizing larger values.
Normalization: Divides each exponentiated value by the sum of all exponentiated values, ensuring the output values add up to 1, forming a valid probability distribution.
Roles:
Nonnegative.
Amplification: Exaggerates differences between input values, making larger values significantly more prominent.
Non-linearity: Introduces non-linearity into the neural network.
Optimization: Differentiable nature of softmax function allows for efficient use of gradient-based optimization algorithms during model training.
Cross-Entropy Loss
- Entropy
- \[H[P] = - \sum_j P(j) \log P(j)\]
P(j): Probability of j.
-log P(j): Least coding bits of P(j).
- Cross-Entropy
- \[H(P, Q) = - \sum_j P(j) \log Q(j)\]
P(j): Real probability of j.
Q(j): Predicted probability of j.
For P(j), the least coding bits will be -log P(j).
-log Q(j) >= -log P(j)
H(P, Q) >= H(P) >= 0
- Cross-Entropy Loss
- \[l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j\]
\(\mathbf{y}\): Real probability distribution.
\(\hat{\mathbf{y}}\): Predicted probability distribution.
\(l(\mathbf{y}, \hat{\mathbf{y}})\) >= \(l(\mathbf{y})\) >= 0
- Softmax
- \[\hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}\]
- Softmax and Cross-Entropy Loss
- \[\begin{split}\begin{aligned} l(\mathbf{y}, \hat{\mathbf{y}}) &= - \sum_{j=1}^q y_j \log \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} \\ &= \sum_{j=1}^q y_j \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j \\ &= \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j. \end{aligned}\end{split}\]
- Derivative with respect to any logit \(o_j\)
- \[\partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j\]
The gradient of the linear regresion: \(\hat{\mathbf{y}} - \mathbf{y}\)
The gradient of the softmax regrestion: \(\mathrm{softmax}(\mathbf{o}) - y\)