Skip to main content

K-Mean Clustering

 K-means clustering is an unsupervised machine learning algorithm used to partition a dataset into distinct groups, or clusters. The goal is to group similar data points together while maximizing the differences between clusters. Here's how it works:

1.      Initialization: Choose the number of clusters k and randomly initialize k centroids (the center points of the clusters).

2.      Assignment Step: Assign each data point to the nearest centroid, forming k clusters.

3.      Update Step: Recalculate the centroids by taking the mean of all data points in each cluster.

4.      Repeat: Continue the assignment and update steps until the centroids no longer change significantly or a set number of iterations is reached.

K-means is popular due to its simplicity and efficiency, but it has some limitations, such as sensitivity to the initial placement of centroids and difficulty with clusters of varying shapes and sizes.

Step 1: Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Step 2: Load Your Data

You can load your dataset using Pandas. Make sure your data is numerical and does not contain missing values.

# Example: Load a CSV file
data = pd.read_csv('your_data.csv')
# Display the first few rows
print(data.head())

Step 3: Data Preprocessing

  1. Handle Missing Values: Remove or impute missing values.
  2. Feature Scaling: Standardize your data for better clustering performance.
# Handle missing values if any
data = data.dropna()  # or use imputation
# Standardize the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

Step 4: Choose the Number of Clusters (K)

Using the Elbow Method, you can find an optimal value for K.

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(scaled_data)
    wcss.append(kmeans.inertia_)
# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters (K)')
plt.ylabel('WCSS')
plt.show()

Step 5: Apply K-means Clustering

Choose a value for K based on the Elbow Method and fit the model.

# Assuming the optimal K is found to be 3
k = 3
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(scaled_data)
 
# Add cluster labels to the original data
data['Cluster'] = clusters

Step 6: Visualize the Clusters

For visualization, you might want to reduce dimensions using PCA or simply plot the first two features.

# If you want to visualize the first two dimensions
plt.figure(figsize=(10, 6))
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=clusters, cmap='viridis')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Cluster')
plt.show()

Step 7: Analyze Results

Evaluate the clustering results by analyzing the characteristics of each cluster. You can check the mean values of features per cluster or visualize them further.

# Analyze clusters
cluster_analysis = data.groupby('Cluster').mean()
print(cluster_analysis)

Comments

Popular posts from this blog

ML Lab Questions

1. Using matplotlib and seaborn to perform data visualization on the standard dataset a. Perform the preprocessing b. Print the no of rows and columns c. Plot box plot d. Heat map e. Scatter plot f. Bubble chart g. Area chart 2. Build a Linear Regression model using Gradient Descent methods in Python for a wine data set 3. Build a Linear Regression model using an ordinary least-squared model in Python for a wine data set  4. Implement quadratic Regression for the wine dataset 5. Implement Logistic Regression for the wine data set 6. Implement classification using SVM for Iris Dataset 7. Implement Decision-tree learning for the Tip Dataset 8. Implement Bagging using Random Forests  9.  Implement K-means Clustering    10.  Implement DBSCAN clustering  11.  Implement the Gaussian Mixture Model  12. Solve the curse of Dimensionality by implementing the PCA algorithm on a high-dimensional 13. Comparison of Classification algorithms  14. Compa...

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular density-based clustering algorithm that groups data points based on their density in feature space. It’s beneficial for datasets with clusters of varying shapes, sizes, and densities, and can identify noise or outliers. Step 1: Initialize Parameters Define two important parameters: Epsilon (ε) : The maximum distance between two points for them to be considered neighbors. Minimum Points (minPts) : The minimum number of points required in an ε-radius neighborhood for a point to be considered a core point. Step 2: Label Each Point as Core, Border, or Noise For each data point P P P in the dataset: Find all points within the ε radius of P P P (the ε-neighborhood of P P P ). Core Point : If P P P has at least minPts points within its ε-neighborhood, it’s marked as a core point. Border Point : If P P P has fewer than minPts points in its ε-neighborhood but is within the ε-neighborhood of a core point, it’...

Gaussian Mixture Model

A Gaussian Mixture Model (GMM) is a probabilistic model used for clustering and density estimation. It assumes that data is generated from a mixture of several Gaussian distributions, each representing a cluster within the dataset. Unlike K-means, which assigns data points to the nearest cluster centroid deterministically, GMM considers each data point as belonging to each cluster with a certain probability, allowing for soft clustering. GMM is ideal when: Clusters have elliptical shapes or different spreads : GMM captures varying shapes and densities, unlike K-means, which assumes clusters are spherical. Soft clustering is preferred : If you want to know the probability of a data point belonging to each cluster (not a hard assignment). Data has overlapping clusters : GMM allows a point to belong partially to multiple clusters, which is helpful when clusters have significant overlap. Applications of GMM Image Segmentation : Used to segment images into regions, where each region can be...