Implementing DBSCAN algorithm using Sklearn

Last Updated : 1 Apr, 2026

DBSCAN is a clustering algorithm that groups closely packed points and marks low-density points as outliers. It does not require a predefined number of clusters and can detect clusters of arbitrary shapes. Using scikit-learn, it is used to identify clusters and detect noise in data.

Identifies core points using eps (distance) and min_samples (minimum neighbors).
Expands clusters from core points by including all reachable points within eps distance.
Marks points not belonging to any cluster as noise or outliers.

Step By Step Implementation

Here we implement the DBSCAN clustering algorithm on a moon-shaped dataset using Scikit-learn and visualize the results.

Step 1: Import Required Libraries

Import necessary libraries numpy for numerical operations, matplotlib.pyplot for visualization, make_moons to create a sample dataset, DBSCAN for clustering and NearestNeighbors to estimate distances for epsilon.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

Step 2: Generate Moon-shaped Dataset

Here we generate a 2D moon-shaped dataset with 5000 points and some noise.

Python

X, _ = make_moons(n_samples=5000, noise=0.05, random_state=42)

Step 3: Visualize the Dataset

Before clustering, we plot the dataset to understand its structure. Smaller markers and semi-transparency help in visualizing large datasets clearly.

Python

plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c='blue', s=10, alpha=0.5, edgecolor='k')
plt.title('Moon-shaped Dataset (5000 points)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Output:

Screenshot-2026-03-03-104254 — Dataset

Step 4: Plot k-distance Graph for Epsilon Selection

The k-distance graph shows the distance of each point to its k-th nearest neighbor. The “elbow” of this graph helps in selecting the optimal epsilon (eps) for DBSCAN.

The k-distance graph plots each point’s distance to its k-th nearest neighbor to help choose the optimal epsilon for DBSCAN.

Python

def plot_k_distance_graph(X, k):
    neigh = NearestNeighbors(n_neighbors=k)
    neigh.fit(X)
    distances, _ = neigh.kneighbors(X)
    distances = np.sort(distances[:, k-1])
    
    plt.figure(figsize=(10, 6))
    plt.plot(distances, marker='o', markersize=3)
    plt.xlabel('Points sorted by distance')
    plt.ylabel(f'{k}-th nearest neighbor distance')
    plt.title('K-distance Graph')
    plt.grid(True)
    plt.show()
    
plot_k_distance_graph(X, k=10)

Output:

Screenshot-2026-03-03-104648 — K-Distance Graph

Step 5: Apply DBSCAN Clustering

Here we apply DBSCAN on the dataset with the chosen eps and min_samples. DBSCAN automatically identifies clusters of varying shapes and sizes and labels noise points as -1.

Python

epsilon = 0.12
min_samples = 10
dbscan_model = DBSCAN(eps=epsilon, min_samples=min_samples)
cluster_labels = dbscan_model.fit_predict(X)

Step 6: Visualize DBSCAN Clustering Results

Plot each cluster with a unique color. Noise points are highlighted in red. Smaller marker size and transparency make the visualization clear for 5000 points.

Python

plt.figure(figsize=(10, 6))

unique_labels = set(cluster_labels)
colors = plt.cm.viridis(np.linspace(0, 1, len(unique_labels)))

for k, col in zip(unique_labels, colors):
    class_member_mask = (cluster_labels == k)
    xy = X[class_member_mask]
    if k == -1:
        plt.scatter(xy[:, 0], xy[:, 1], c='red', s=10, alpha=0.5, edgecolor='k', label='Noise')
    else:
        plt.scatter(xy[:, 0], xy[:, 1], c=[col], s=10, alpha=0.5, edgecolor='k', label=f'Cluster {k}')

plt.title('DBSCAN Clustering Results (5000 points)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Output:

Screenshot-2026-03-03-105540 — Result

Step 7: Summary of Clusters

Here we summarize the number of clusters detected by DBSCAN and the points that were classified as noise.

Python

n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
n_noise = list(cluster_labels).count(-1)
print(f'Number of clusters found: {n_clusters}')
print(f'Number of noise points: {n_noise}')

Output:

Number of clusters found: 2
Number of noise points: 0

Download full code from here.

Comment

Article Tags:

Machine Learning

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice