K-means clustering is an unsupervised machine learning algorithm used to partition a dataset into distinct groups, or clusters. The goal is to group similar data points together while maximizing the differences between clusters. Here's how it works:
1. Initialization:
Choose the number of clusters
and randomly initialize
centroids (the center points of the clusters).
2. Assignment
Step: Assign each data point to the nearest centroid, forming
clusters.
3. Update
Step: Recalculate the centroids by taking the mean of all data points
in each cluster.
4. Repeat:
Continue the assignment and update steps until the centroids no longer change
significantly or a set number of iterations is reached.
K-means is popular due to its simplicity and efficiency, but it has some
limitations, such as sensitivity to the initial placement of centroids and
difficulty with clusters of varying shapes and sizes.
Step 1: Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
Step 2: Load Your Data
You can
load your dataset using Pandas. Make sure your data is numerical and does not
contain missing values.
data = pd.read_csv('your_data.csv')
# Display the first few rows
print(data.head())
Step 3: Data Preprocessing
- Handle Missing Values: Remove or impute missing
values.
- Feature Scaling: Standardize your data for
better clustering performance.
data = data.dropna() # or use imputation
# Standardize the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
Step 4: Choose the Number of Clusters (K)
Using the
Elbow Method, you can find an optimal value for K.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(scaled_data)
wcss.append(kmeans.inertia_)
# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters (K)')
plt.ylabel('WCSS')
plt.show()
Step 5: Apply K-means Clustering
Choose a
value for K based on the Elbow Method and fit the model.
k = 3
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(scaled_data)
data['Cluster'] = clusters
Step 6: Visualize the Clusters
For
visualization, you might want to reduce dimensions using PCA or simply plot the
first two features.
plt.figure(figsize=(10, 6))
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=clusters, cmap='viridis')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Cluster')
plt.show()
Step 7: Analyze Results
Evaluate
the clustering results by analyzing the characteristics of each cluster. You can
check the mean values of features per cluster or visualize them further.
cluster_analysis = data.groupby('Cluster').mean()
print(cluster_analysis)
Comments