Skip to main content

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular density-based clustering algorithm that groups data points based on their density in feature space. It’s beneficial for datasets with clusters of varying shapes, sizes, and densities, and can identify noise or outliers.

Step 1: Initialize Parameters

Define two important parameters:

  • Epsilon (ε): The maximum distance between two points for them to be considered neighbors.
  • Minimum Points (minPts): The minimum number of points required in an ε-radius neighborhood for a point to be considered a core point.

Step 2: Label Each Point as Core, Border, or Noise

For each data point PP in the dataset:

  1. Find all points within the ε radius of PP (the ε-neighborhood of PP).
  2. Core Point: If PP has at least minPts points within its ε-neighborhood, it’s marked as a core point.
  3. Border Point: If PP has fewer than minPts points in its ε-neighborhood but is within the ε-neighborhood of a core point, it’s labeled a border point.
  4. Noise Point: If PP doesn’t satisfy either condition (not enough neighbors, not within a core point's neighborhood), it’s marked as noise.

Step 3: Form Clusters from Core Points

  1. Start with an unvisited core point: A new cluster begins if the point is a core point.
  2. Expand the cluster: Add all reachable points within ε of the core point to the cluster. Mark each added point as "visited."
  3. For each new point added to the cluster:
    • If it’s also a core point, repeat the expansion process by adding all its reachable neighbors within ε to the cluster.
    • If it’s a border point, add it to the cluster but don’t expand further from it (since it doesn’t have enough density to expand).

Step 4: Mark Noise Points

Any points not assigned to any cluster during the expansion process are labeled as noise or outliers.

Step 5: Repeat for All Unvisited Points

Continue with the remaining unvisited points:

  • If the next unvisited point is a core point, start a new cluster and expand it.
  • If it's a border point or noise, mark it accordingly without forming a new cluster.

Step 6: Output the Clusters

When all points are visited, DBSCAN completes, resulting in clusters of core and border points, and isolated points marked as noise.

Implementation

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Generate synthetic data (e.g., with two moon-shaped clusters)
X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# Define DBSCAN parameters
epsilon = 0.2  # Radius for neighborhood
min_samples = 5  # Minimum number of points for a core point

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=epsilon, min_samples=min_samples)
labels = dbscan.fit_predict(X)

# Plotting the clusters and noise
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))

# Plot each cluster with a different color, mark noise as black
for label, color in zip(unique_labels, colors):
    if label == -1:  # Noise points have a label of -1
        color = "black"  # Mark noise as black
    class_members = (labels == label)
    plt.plot(X[class_members, 0], X[class_members, 1], "o", markerfacecolor=color, markeredgecolor="k", markersize=6)

plt.title("DBSCAN Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()




Comments

Popular posts from this blog

Logistic Regression

Logistic regression is a statistical method used for binary classification problems. It's particularly useful when you need to predict the probability of a binary outcome based on one or more predictor variables. Here's a breakdown: What is Logistic Regression? Purpose : It models the probability of a binary outcome (e.g., yes/no, success/failure) using a logistic function (sigmoid function). Function : The logistic function maps predicted values (which are in a range from negative infinity to positive infinity) to a probability range between 0 and 1. Formula : The model is typically expressed as: P ( Y = 1 ∣ X ) = 1 1 + e − ( β 0 + β 1 X ) P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} P ( Y = 1∣ X ) = 1 + e − ( β 0 ​ + β 1 ​ X ) 1 ​ Where P ( Y = 1 ∣ X ) P(Y = 1 | X) P ( Y = 1∣ X ) is the probability of the outcome being 1 given predictor X X X , and β 0 \beta_0 β 0 ​ and β 1 \beta_1 β 1 ​ are coefficients estimated during model training. When to Apply Logistic R...

Linear Regression using Ordinary Least Square method

Ordinary Least Square Method Download Dataset Step 1: Import the necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt Step 2: Load the CSV Data # Load the dataset data = pd.read_csv('house_data.csv') # Extract the features (X) and target variable (y) X = data['Size'].values y = data['Price'].values # Reshape X to be a 2D array X = X.reshape(-1, 1) # Add a column of ones to X for the intercept X_b = np.c_[np.ones((X.shape[0], 1)), X] Step 3: Add a Column of Ones to X for the Intercept # Add a column of ones to X for the intercept X_b = np.c_[np.ones((X.shape[0], 1)), X] Step 4: Implement the OLS Method # Calculate the OLS estimate of theta (the coefficients) theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y) Step 5: Make Predictions # Make predictions y_pred = X_b.dot(theta_best) Step 6: Visualize the Results # Plot the data and the regression line plt.scatter(X, y, color='blue', label='Data') plt.pl...

Quadratic Regression

  Quadratic regression is a statistical method used to model a relationship between variables with a parabolic best-fit curve, rather than a straight line. It's ideal when the data relationship appears curvilinear. The goal is to fit a quadratic equation   y=ax^2+bx+c y = a ⁢ x 2 + b ⁢ x + c to the observed data, providing a nuanced model of the relationship. Contrary to historical or biological connotations, "regression" in this mathematical context refers to advancing our understanding of complex relationships among variables, particularly when data follows a curvilinear pattern. Working with quadratic regression These calculations can become quite complex and tedious. We have just gone over a few very detailed formulas, but the truth is that we can handle these calculations with a graphing calculator. This saves us from having to go through so many steps -- but we still must understand the core concepts at play. Let's try a practice problem that includes quadratic ...