Notes of K-Means
Input
- \(K\) (number of clusters)
- Training set \(\{ x^{(1)}, x^{(2)} \dots x^{(m)} \}\)
Algorithm
Randomly initialize \(K\) cluster centroids \(\mu_1, \mu_2 \dots \mu_K \in \mathbb{R}^n\)
Repeat
\(\;\;\) for \(i\) = \(1\) to \(m\)
\(\;\;\;\;\) \(c^{(i)}\) := index (from \(1\) to \(K\)) of cluster centroid closest to \(x^{(i)}\)
\(\;\;\) for \(k\) = \(1\) to \(K\)
\(\;\;\;\;\) \(\mu_k\) := average (mean) of points assigned to cluster \(k\)
Run 100 times, pick clustering that gave lowest cost \(J(c^{(1)} \dots c^{(m)}, \mu_1 \dots \mu_K)\)
Optimization Objective
\(\displaystyle J(c^{(1)} \dots c^{(m)}, \mu_1 \dots \mu_K) = \frac{1}{m} \sum_{i=1}^m ||x^{(i)} - \mu_{c^{(i)}}|| ^ 2\)
Random Initialization
- Should have \(K < m\)
- Randomly pick \(K\) training examples
- Set \(\mu_1 \dots \mu_K\) equal to these \(K\) examples
Choosing the Number of Clusters
Sometimes, you are running K-means to get clusters to use for some later or downstream purpose Evaluate K-means based on a metric for how well it performs for that later purpose