Classification Problems

The Softmax

Purpose: Transforms a vector of real numbers into a probability distribution, where each value represents the probability of a specific class or outcome.
Usage:
- Final activation function in neural networks for multi-class classification tasks.
- Used with cross-entropy loss, a common choice for multi-class classification, which optimization functions aim to minimize.
Process:
- Exponentiation: Applies the exponential function (e^x) to each element of the input vector, emphasizing larger values.
- Normalization: Divides each exponentiated value by the sum of all exponentiated values, ensuring the output values add up to 1, forming a valid probability distribution.
Roles:
- Nonnegative.
- Amplification: Exaggerates differences between input values, making larger values significantly more prominent.
- Non-linearity: Introduces non-linearity into the neural network.
- Optimization: Differentiable nature of softmax function allows for efficient use of gradient-based optimization algorithms during model training.

Cross-Entropy Loss

Entropy

\[H[P] = - \sum_j P(j) \log P(j)\]

P(j): Probability of j.
-log P(j): Least coding bits of P(j).

Cross-Entropy

\[H(P, Q) = - \sum_j P(j) \log Q(j)\]

P(j): Real probability of j.
Q(j): Predicted probability of j.
For P(j), the least coding bits will be -log P(j).
-log Q(j) >= -log P(j)
H(P, Q) >= H(P) >= 0

Cross-Entropy Loss

\[l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j\]

\(\mathbf{y}\): Real probability distribution.
\(\hat{\mathbf{y}}\): Predicted probability distribution.
\(l(\mathbf{y}, \hat{\mathbf{y}})\) >= \(l(\mathbf{y})\) >= 0

Softmax

\[\hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}\]

Softmax and Cross-Entropy Loss

\[\begin{split}\begin{aligned} l(\mathbf{y}, \hat{\mathbf{y}}) &= - \sum_{j=1}^q y_j \log \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} \\ &= \sum_{j=1}^q y_j \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j \\ &= \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j. \end{aligned}\end{split}\]

Derivative with respect to any logit \(o_j\)

\[\partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j\]

The gradient of the linear regresion: \(\hat{\mathbf{y}} - \mathbf{y}\)
The gradient of the softmax regrestion: \(\mathrm{softmax}(\mathbf{o}) - y\)