Notes of Logistic Regression
Hypothesis Representation
\(h_\theta(x) = g(\theta^T x)\)
Sigmoid Function
\(g(z) = \dfrac{1}{1 + e^{-z}}\) \(\in (0, 1)\)
Probability
\(P(y = 1 \mid x; \theta) = h_\theta(x)\)
\(P(y = 0 \mid x; \theta) = 1 - h_\theta(x)\)
Decision Boundary
The line that separates the area where y = 0 and where y = 1
\(\theta^T x \geq 0\) \(\to\) \(h_\theta(x) \geq 0.5\) \(\to\) \(y = 1\)
\(\theta^T x < 0\) \(\to\) \(h_\theta(x) < 0.5\) \(\to\) \(y = 0\)
Cost Function
\(\displaystyle J(\theta) = \frac{1}{m} \sum_{i=1}^m \mathrm{cost}(h_\theta(x^{(i)}), y^{(i)})\)
\(\mathrm{cost}(h_\theta(x), y) = - y \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x)) = \begin{cases} -\log(h_\theta(x)) & \quad \text{if} \; y = 1 \newline -\log(1 - h_\theta(x)) & \quad \text{if} \; y = 0 \newline \end{cases}\)
\(s.t.\) \(\mathrm{cost}(h_\theta(x), y) = \begin{cases} 0 & \quad \text{if} \; h_\theta(x) = y \newline \to \infty & \quad \text{if} \; y = 1 \; \mathrm{and} \; h_\theta(x) \to 0 \newline \to \infty & \quad \text{if} \; y = 0 \; \mathrm{and} \; h_\theta(x) \to 1 \newline \end{cases}\)
Vectorization
\(\displaystyle J(\theta) = \frac{1}{m} \left( - y^T \log(g(X \theta)) - (1 - y)^T \log(1 - g(X \theta)) \right)\)
Gradient Descent
Repeat until convergence
\(\;\;\) \(\displaystyle \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \quad \text{for} \; j \gets 0 \dots n\)
where
\(\;\;\) \(\alpha\): learning rate
\(\;\) \(\displaystyle \frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \; x_j^{(i)}\)
(simultaneous update)
Vectorization
\(\displaystyle \theta := \theta - \alpha \frac{1}{m} X^T (g(X \theta) - y)\)
Multiclass Classification
one-vs-all
\(y \in \lbrace 1 \dots k \rbrace\)
\(h_\theta(x) \in \mathbb{R} ^ {K}\)
\(h_\theta(x)_k = P(y = k \mid x; \theta)\)
\(\mathrm{prediction} = \max_k(h_\theta(x)_k)\)
where
\(\;\;\) \(K\): number of classes
Regularization
\(\displaystyle J(\theta) = \frac{1}{m} \sum_{i=1}^m \mathrm{cost}(h_\theta(x^{(i)}), y^{(i)}) + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2\)
where
\(\;\;\) \(\lambda\): regularization parameter
\(\displaystyle \sum_{j=1}^n \theta_j^2\) excludes the bias term \(\theta_0\)