Skip to main content

K-Mean Clustering

 K-means clustering is an unsupervised machine learning algorithm used to partition a dataset into distinct groups, or clusters. The goal is to group similar data points together while maximizing the differences between clusters. Here's how it works:

1.      Initialization: Choose the number of clusters k and randomly initialize k centroids (the center points of the clusters).

2.      Assignment Step: Assign each data point to the nearest centroid, forming k clusters.

3.      Update Step: Recalculate the centroids by taking the mean of all data points in each cluster.

4.      Repeat: Continue the assignment and update steps until the centroids no longer change significantly or a set number of iterations is reached.

K-means is popular due to its simplicity and efficiency, but it has some limitations, such as sensitivity to the initial placement of centroids and difficulty with clusters of varying shapes and sizes.

Step 1: Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Step 2: Load Your Data

You can load your dataset using Pandas. Make sure your data is numerical and does not contain missing values.

# Example: Load a CSV file
data = pd.read_csv('your_data.csv')
# Display the first few rows
print(data.head())

Step 3: Data Preprocessing

  1. Handle Missing Values: Remove or impute missing values.
  2. Feature Scaling: Standardize your data for better clustering performance.
# Handle missing values if any
data = data.dropna()  # or use imputation
# Standardize the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

Step 4: Choose the Number of Clusters (K)

Using the Elbow Method, you can find an optimal value for K.

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(scaled_data)
    wcss.append(kmeans.inertia_)
# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters (K)')
plt.ylabel('WCSS')
plt.show()

Step 5: Apply K-means Clustering

Choose a value for K based on the Elbow Method and fit the model.

# Assuming the optimal K is found to be 3
k = 3
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(scaled_data)
 
# Add cluster labels to the original data
data['Cluster'] = clusters

Step 6: Visualize the Clusters

For visualization, you might want to reduce dimensions using PCA or simply plot the first two features.

# If you want to visualize the first two dimensions
plt.figure(figsize=(10, 6))
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=clusters, cmap='viridis')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Cluster')
plt.show()

Step 7: Analyze Results

Evaluate the clustering results by analyzing the characteristics of each cluster. You can check the mean values of features per cluster or visualize them further.

# Analyze clusters
cluster_analysis = data.groupby('Cluster').mean()
print(cluster_analysis)

Comments

Popular posts from this blog

Logistic Regression

Logistic regression is a statistical method used for binary classification problems. It's particularly useful when you need to predict the probability of a binary outcome based on one or more predictor variables. Here's a breakdown: What is Logistic Regression? Purpose : It models the probability of a binary outcome (e.g., yes/no, success/failure) using a logistic function (sigmoid function). Function : The logistic function maps predicted values (which are in a range from negative infinity to positive infinity) to a probability range between 0 and 1. Formula : The model is typically expressed as: P ( Y = 1 ∣ X ) = 1 1 + e − ( β 0 + β 1 X ) P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} P ( Y = 1∣ X ) = 1 + e − ( β 0 ​ + β 1 ​ X ) 1 ​ Where P ( Y = 1 ∣ X ) P(Y = 1 | X) P ( Y = 1∣ X ) is the probability of the outcome being 1 given predictor X X X , and β 0 \beta_0 β 0 ​ and β 1 \beta_1 β 1 ​ are coefficients estimated during model training. When to Apply Logistic R...

Linear Regression using Ordinary Least Square method

Ordinary Least Square Method Download Dataset Step 1: Import the necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt Step 2: Load the CSV Data # Load the dataset data = pd.read_csv('house_data.csv') # Extract the features (X) and target variable (y) X = data['Size'].values y = data['Price'].values # Reshape X to be a 2D array X = X.reshape(-1, 1) # Add a column of ones to X for the intercept X_b = np.c_[np.ones((X.shape[0], 1)), X] Step 3: Add a Column of Ones to X for the Intercept # Add a column of ones to X for the intercept X_b = np.c_[np.ones((X.shape[0], 1)), X] Step 4: Implement the OLS Method # Calculate the OLS estimate of theta (the coefficients) theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y) Step 5: Make Predictions # Make predictions y_pred = X_b.dot(theta_best) Step 6: Visualize the Results # Plot the data and the regression line plt.scatter(X, y, color='blue', label='Data') plt.pl...

Quadratic Regression

  Quadratic regression is a statistical method used to model a relationship between variables with a parabolic best-fit curve, rather than a straight line. It's ideal when the data relationship appears curvilinear. The goal is to fit a quadratic equation   y=ax^2+bx+c y = a ⁢ x 2 + b ⁢ x + c to the observed data, providing a nuanced model of the relationship. Contrary to historical or biological connotations, "regression" in this mathematical context refers to advancing our understanding of complex relationships among variables, particularly when data follows a curvilinear pattern. Working with quadratic regression These calculations can become quite complex and tedious. We have just gone over a few very detailed formulas, but the truth is that we can handle these calculations with a graphing calculator. This saves us from having to go through so many steps -- but we still must understand the core concepts at play. Let's try a practice problem that includes quadratic ...