Building Spotify-Style Playlists with K-Means

Spotify’s catalogue exceeds 100 million tracks, which means no human can possibly go through all that music and curate playlists, so it begs the question how does the app consistently surface a 30 song playlist that feels like it was hand picked by you.

The answer lies in machine learning, specifically, a simple algorithm like K-Means clustering. It does not need labelled “good” or “bad” data. Instead, it looks at the shape of every song, for the case of spotify it would look at the tempo, energy, loudness, acousticness and group tracks that live in the same sonic neighbourhood or rather songs that have similar patterns.

In this tutorial, I try to replicate that process end-to-end. By the final section, you will have working python code that ingests audio features from a 10,000-track Spotify dataset, discovers hidden groupings in the data, and outputs coherent, themed playlists.

The Pipeline

Before we touch any code, let’s map the pipeline. The entire workflow moves through five stages:

Ingest: load audio features for a target set of tracks.
Clean: drop nulls and isolate numeric metadata.
Scale: standardise every feature so no single dimension dominates the distance calculation.
Cluster: run K-Means to discover natural song groupings.
Playlist: extract the cluster closest to a user’s taste profile and serve the top tracks.

Each stage feeds directly into the next. If you skip one, especially scaling and the downstream results would degrade significantly. We will work through each stage in that order.

The Data

As humans we when we listen to music we hear melodies, harmonies, and lyrics. An algorithm on the other hand sees a vector of numbers. Spotify’s Audio Features API returns a numeric fingerprint for every track and these are the dimensions K-Means will use to gauge similarity.

The key features we will work with:

Energy: perceptual intensity and activity. Death metal scores high; a solo piano ballad would score low.
Danceability: how suitable a track is for dancing, based on tempo, rhythm stability, and beat strength.
Valence: musical positiveness. High valence = happy, cheerful. Low valence = sad, angry.
Loudness: overall loudness in decibels, averaged across the entire track.
Tempo: estimated beats per minute. The overall speed of the track.
Speechiness: the presence of spoken words. Talk shows score high; instrumentals score near zero.

Each feature becomes an axis in a multi-dimensional space, by this two songs that are close together in this space sound similar even if they come from completely different genres. K-Means exploits exactly this geometric proximity to gauge similarity and deliver interesting playlists on your phone.

Let’s load the dataset and see what these fingerprints look like in practice.

Code

data_path = Path("/Users/nangosyah/Documents/Apps & Web/nangosyah.github.io/posts/spotify-kmeans/ML_spotify_data.csv")
spotify = pd.read_csv(data_path)
spotify.head()

	name	artists	popularity	danceability	valence	energy	key	liveness	loudness	speechiness	tempo
0	We're For The Dark - Remastered 2010	['Badfinger']	22	0.678	0.559	0.432	3	0.0727	-12.696	0.0334	117.674
1	Sixty Years On - Piano Demo	['Elton John']	25	0.456	0.259	0.368	6	0.1560	-10.692	0.0280	143.783
2	Got to Find Another Way	['The Guess Who']	21	0.433	0.833	0.724	0	0.1700	-9.803	0.0378	84.341
3	Feelin' Alright - Live At The Fillmore East/1970	['Joe Cocker']	22	0.436	0.870	0.914	5	0.8550	-6.955	0.0610	174.005
4	Caravan - Take 7	['Van Morrison']	23	0.669	0.564	0.412	7	0.4010	-13.095	0.0679	78.716

The dataset contains 10,000 songs and 12 columns. Now let’s isolate the numeric audio features, the dimensions K-Means will operate on.

Code

spotify_num = spotify.select_dtypes(include=np.number)
print(f"Shape: {spotify_num.shape[0]} songs × {spotify_num.shape[1]} features")
spotify_num.describe().round(2)

Shape: 10000 songs × 10 features

	popularity	danceability	valence	energy	explicit	key	liveness	loudness	speechiness	tempo
count	10000.00	10000.00	10000.00	10000.00	10000.0	10000.00	10000.00	10000.00	10000.00	10000.00
mean	37.56	0.55	0.52	0.59	0.1	5.21	0.21	-9.82	0.08	120.18
std	12.56	0.18	0.26	0.25	0.3	3.56	0.19	5.32	0.10	30.26
min	0.00	0.00	0.00	0.00	0.0	0.00	0.00	-60.00	0.00	0.00
25%	27.00	0.43	0.31	0.41	0.0	2.00	0.09	-12.35	0.03	96.19
50%	36.00	0.56	0.52	0.62	0.0	5.00	0.13	-8.76	0.05	118.33
75%	46.00	0.68	0.74	0.80	0.0	9.00	0.26	-6.07	0.08	139.61
max	86.00	0.99	1.00	1.00	1.0	11.00	1.00	1.07	0.96	224.44

Notice the scale differences: tempo ranges from roughly 60 to 200, while valence is bounded between 0 and 1. This mismatch will matter enormously when we get to clustering.

Let’s explore how these features relate to each other. We use a correlation for this, a correlation matrix reveals which dimensions move together and which carry independent information.

Code

corr = spotify_num.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(
    corr,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    center=0,
    linewidths=0.5,
)
plt.title("Correlation matrix for Spotify audio features")
plt.tight_layout()

A few patterns stand out. energy and loudness are strongly positively correlated i.e., louder tracks tend to feel more intense. Meanwhile, features like danceability and valence are relatively independent, meaning they carry distinct information that K-Means can use to find finer-grained groupings.

Let’s visualise three of the most important audio dimensions in 3D to get a sense of whether natural groupings exist before the algorithm even runs.

Code

fig = px.scatter_3d(
    spotify,
    x="danceability",
    y="energy",
    z="valence",
    color="popularity",
    color_continuous_scale="Viridis",
    opacity=0.7,
    title="Songs in 3D audio feature space",
)
fig

The data does not form obvious, well-separated blobs and that’s normal. Real-world music is a mix of lost of genres, with various songs from different countries etc. When we look at the 3D image and rotate it you can see every song is a bit mixed it’s on this basis that the K-Means algorithm is used to will impose structure by finding regions of higher density and drawing boundaries between them.

The Algorithm

K-Means is an iterative algorithm that partitions n observations into k clusters, where each observation belongs to the cluster with the nearest mean (centroid). It operates in three steps:

Step 1: Initialise

Place k centroids at random positions in the feature space. These are the initial “guesses” for where the cluster centres might be.

Step 2: Assign

Calculate the Euclidean distance from every song to every centroid. Assign each song to the centroid it is closest to. After this step, every song has a cluster label.

Step 3: Update

Recalculate each centroid as the mean position of all songs assigned to it. The centroid physically moves to the centre of its cluster.

Repeat Steps 2 and 3 until the centroids stop moving (convergence). That’s it. Despite its simplicity, K-Means discovers structure that often aligns strikingly well with human intuition, grouping tracks together, separating for example high-energy bangers from moody songs, and isolating niche micro-genres of music.

But there is a catch. K-Means uses Euclidean distance, which means features with larger numerical ranges will dominate the calculation. Let’s demonstrate this by running K-Means on our raw, unscaled data.

Code

kmeans_raw = KMeans(random_state=42, n_init=10)
labels_raw = kmeans_raw.fit_predict(spotify_num)
raw_counts = pd.Series(labels_raw).value_counts().sort_index()
print(f"Raw inertia: {kmeans_raw.inertia_:,.0f}")
raw_counts

Raw inertia: 1,695,891

0    1197
1    1117
2     790
3    1551
4    1035
5    1112
6    1780
7    1418
Name: count, dtype: int64

The clusters are heavily imbalanced. Tempo (60–200) is drowning out valence (0–1). The algorithm is essentially clustering on tempo alone, which is not what we want. This is why scaling is non-negotiable.

Choosing K

The one hyper-parameter K-Means asks you to set upfront is k, which refers to the number of clusters that you suppose exist in your data. Choose too few and you lump dissimilar songs together. Choose too many and you shatter coherent groups into small fragmented songs. In order to solve for this problem a two standard techniques are normally used to help find the sweet spot.

The Elbow Method

Run K-Means for a range of k values and plot the total within-cluster sum of squares (inertia), a measure of how tightly packed each cluster is. Inertia drops sharply at first, then the curve bends and flattens. The “elbow” which is the point where adding another cluster gives diminishing returns is your target k.

Silhouette Score

The silhouette score measures how similar a song is to its own cluster versus the nearest neighbouring cluster. It ranges from −1 (wrong cluster) to +1 (perfectly placed). In this method we pick the k that maximises the average silhouette score.

Let’s run both diagnostics on the scaled features.

Code

scaler = StandardScaler()
spotify_scaled = scaler.fit_transform(spotify_num)

inertia = []
sil_scores = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(spotify_scaled)
    inertia.append(km.inertia_)
    sil_scores.append(silhouette_score(spotify_scaled, km.labels_))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(K_range, inertia, marker="o", color="#1DB954")
axes[0].set_xticks(range(2, 11))
axes[0].set_xlabel("Number of clusters (K)")
axes[0].set_ylabel("Inertia")
axes[0].set_title("Elbow Method")
axes[0].grid(alpha=0.3)

axes[1].plot(K_range, sil_scores, marker="o", color="#a855f7")
axes[1].set_xticks(range(2, 11))
axes[1].set_xlabel("Number of clusters (K)")
axes[1].set_ylabel("Silhouette Score")
axes[1].set_title("Silhouette Score")
axes[1].grid(alpha=0.3)

plt.tight_layout()

The elbow suggests that the marginal benefit of adding clusters starts to taper around k = 5 or 6. We choose K = 6, a number that also aligns with Spotify’s own convention of offering six personalised Daily Mixes. This gives us enough granularity to capture distinct groups without fragmenting the data into uninterpretable fragmented groups.

Building the Playlist

Now we assemble the full pipeline: scale, reduce, cluster, and generate playlists.

Scale Features

Standardisation transforms every feature to zero mean and unit variance. After scaling, a one-unit change in danceability carries the same weight as a one-unit change in tempo. We already computed spotify_scaled above now let’s verify the improvement by clustering it.

Code

kmeans_scaled = KMeans(n_clusters=6, random_state=42, n_init=10)
labels_scaled = kmeans_scaled.fit_predict(spotify_scaled)
scaled_counts = pd.Series(labels_scaled).value_counts().sort_index()
print(f"Scaled inertia: {kmeans_scaled.inertia_:,.0f}")
scaled_counts

Scaled inertia: 56,590

0    1750
1    1024
2    2588
3    1873
4    2074
5     691
Name: count, dtype: int64

Compare this to the raw clustering: the inertia is dramatically lower and the cluster sizes are far more balanced. Every feature now contributes proportionally to the distance calculation.

Reduce Dimensions with PCA

We now have 10 numeric features, which means each song lives as a point in a 10-dimensional space. That is difficult to reason about and contains redundant information, energy and loudness, for example, tend to move together, so they are partly telling us the same thing.

Principal Component Analysis (PCA) solves this by finding a smaller set of new axes called principal components that together capture as much of the original variation as possible, but without the redundancy.

Think of it like this. Imagine you have a cloud of points spread out in a room. Instead of describing every point using its original x, y, and z coordinates, PCA asks: what is the single direction in this room along which the points are most spread out? That becomes the first principal component. Then it finds the next most important direction that is perpendicular to the first, and so on. Each new axis is a blend of the original features, chosen to summarise the data as efficiently as possible.

For our songs, the first principal component might capture something like “overall sonic intensity” which is a combination of energy, loudness, and tempo. The second might capture “mood” a blend of valence and danceability. We do not define these manually; PCA discovers them from the data.

The practical benefit here is twofold: fewer, cleaner inputs for K-Means to work with, and the ability to visualise clusters in 2D or 3D rather than across 10 axes simultaneously.

Code

pca = PCA(n_components=3, random_state=42)
spotify_pca = pca.fit_transform(spotify_scaled)
explained_ratio = pca.explained_variance_ratio_
print("Explained variance ratio:", np.round(explained_ratio, 4))
print("Cumulative explained:", np.round(np.cumsum(explained_ratio), 4))

Explained variance ratio: [0.2409 0.1492 0.1364]
Cumulative explained: [0.2409 0.3901 0.5264]

The first three components together account for roughly 53% of the total variance in the dataset. We are compressing 10 dimensions down to 3 while retaining more than half the information, a worthwhile trade-off that will remove majority of the noise and give K-Means a cleaner data to work on.

A biplot is the standard way to see what PCA actually did. It overlays two things in a single chart: the songs as points in the new principal component space, and the original audio features as arrows. The direction an arrow points shows which part of the plot that feature pulls songs towards. The length of the arrow reflects how strongly that feature contributed to these two components. Arrows pointing in the same direction indicate features that tend to move together; arrows pointing in opposite directions indicate features that compete.

Code

# Full PCA on all 10 components to extract loadings
pca_full = PCA(random_state=42)
pca_full.fit(spotify_scaled)

# Project songs onto the first two components
scores = pca_full.transform(spotify_scaled)[:, :2]

# Loadings: how each original feature maps onto PC1 and PC2
loadings = pca_full.components_[:2].T
feature_names = spotify_num.columns.tolist()

fig, ax = plt.subplots(figsize=(10, 8))

# Random sample of songs as points to avoid overplotting
rng = np.random.default_rng(42)
sample_idx = rng.choice(len(scores), size=2000, replace=False)
ax.scatter(
    scores[sample_idx, 0], scores[sample_idx, 1],
    alpha=0.18, s=8, color="#2980b9", zorder=1, rasterized=True,
)

# Scale arrows so the longest arrow tip
max_loading = np.sqrt((loadings ** 2).sum(axis=1)).max()
arrow_scale = 5.5 / max_loading
nudge = 1.05  # extra push beyond arrowhead so labels never sit on the tip

for i, feature in enumerate(feature_names):
    dx = loadings[i, 0] * arrow_scale
    dy = loadings[i, 1] * arrow_scale

    ax.annotate(
        "",
        xy=(dx, dy), xytext=(0, 0),
        arrowprops=dict(
            arrowstyle="-|>", color="#c0392b", lw=2.0, mutation_scale=14,
        ),
        zorder=4,
    )

    # Place label just beyond the arrowhead in the same radial direction
    length = np.sqrt(dx ** 2 + dy ** 2)
    lx = dx + (dx / length) * nudge
    ly = dy + (dy / length) * nudge

    ax.text(
        lx, ly, feature,
        fontsize=9.5, color="#922b21",
        ha="center", va="center", fontweight="bold",
        bbox=dict(
            facecolor="white", edgecolor="none",
            alpha=0.8, pad=2, boxstyle="round,pad=0.25",
        ),
        zorder=6,
    )

ax.axhline(0, color="#cccccc", lw=0.8, ls="--", zorder=0)
ax.axvline(0, color="#cccccc", lw=0.8, ls="--", zorder=0)
ax.set_xlim(-8.0, 9.5)
ax.set_ylim(-5.0, 10.5)
ax.set_xlabel("Principal Component 1", fontsize=11)
ax.set_ylabel("Principal Component 2", fontsize=11)
ax.set_title("PCA Biplot: songs and audio feature directions", fontsize=12, pad=12)
plt.tight_layout()

Reading the biplot confirms the intuition. energy and loudness point in nearly the same direction, which tells us they carry similar information where loud tracks are almost always energetic ones. valence and danceability point in a different direction, capturing a separate dimension of mood and rhythm. speechiness and liveness point roughly opposite to instrumentalness-adjacent features, meaning the algorithm separates vocal-heavy tracks from pure instrumentals along this axis. Each blue dot is a song; each red arrow is an audio feature. The further a song sits along the direction an arrow is pointing, the more of that quality it has , so a dot far to the right along the energy arrow belongs to a high-energy track.

Cluster

With scaled, PCA-transformed features, let’s run the final K-Means model.

Code

kmeans_final = KMeans(n_clusters=6, random_state=42, n_init=10)
labels_final = kmeans_final.fit_predict(spotify_pca)
spotify["cluster"] = labels_final

Profile the Clusters

Each cluster is a potential playlist. Let’s examine the average audio fingerprint of each group to understand what kind of music lives there.

Code

cluster_profile = (
    spotify
    .groupby("cluster")
    .agg(
        size=("cluster", "size"),
        popularity=("popularity", "mean"),
        danceability=("danceability", "mean"),
        energy=("energy", "mean"),
        valence=("valence", "mean"),
        tempo=("tempo", "mean"),
        loudness=("loudness", "mean"),
    )
    .round(2)
    .sort_index()
)
cluster_profile

	size	popularity	danceability	energy	valence	tempo	loudness
cluster
0	866	35.74	0.31	0.14	0.16	100.62	-20.91
1	2152	38.10	0.54	0.79	0.63	134.56	-6.64
2	1227	34.53	0.36	0.79	0.37	140.83	-7.52
3	2401	35.38	0.52	0.41	0.40	112.70	-11.81
4	2260	36.59	0.72	0.63	0.78	112.88	-9.05
5	1094	48.10	0.68	0.67	0.51	115.70	-7.13

The profile tells a story. Look for clusters with high energy and loudness, those are your workout mixes. Clusters with low energy and moderate valence are late-night wind-down playlists. High danceability with high valence? That’s your Friday night pre-game. The algorithm found these groupings purely from the geometry of the data with no help from genre labels, mood tags or any particular human curation.

Visualise the Clusters

Let’s confirm the clusters form coherent regions. First, a 2D view using the first two principal components.

Code

plot_df = pd.DataFrame(spotify_pca[:, :2], columns=["PC1", "PC2"])
plot_df["cluster"] = labels_final.astype(str)

fig, ax = plt.subplots(figsize=(8, 5))
sns.scatterplot(
    data=plot_df,
    x="PC1",
    y="PC2",
    hue="cluster",
    palette="tab10",
    alpha=0.6,
    s=40,
    ax=ax,
    edgecolor="none",
)
ax.set_title("KMeans clusters in the first two principal components")
ax.legend(title="Cluster", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()

The clusters occupy distinct regions of the PCA space, with some overlap at the boundaries, exactly what we’d expect from continuous audio features rather than discrete categories.

Now let’s return to the original 3D audio feature space to confirm the groupings hold up in interpretable dimensions.

Code

fig = px.scatter_3d(
    spotify,
    x="danceability",
    y="energy",
    z="valence",
    color=spotify["cluster"].astype(str),
    size="popularity",
    opacity=0.7,
    title="KMeans clusters (K=6) in audio feature space",
)
fig

The clusters are visibly separable across danceability, energy, and valence, three dimensions that map directly to how listeners experience music.

Generate Playlists

Now we label each cluster based on its audio profile and extract a sample playlist from each. This is the payoff, coherent, themed track lists generated entirely by the algorithm.

Code

# Assign readable labels based on each cluster's audio signature
def label_cluster(row):
    if row["energy"] > 0.7 and row["loudness"] > -8:
        return "High-Energy Bangers"
    elif row["energy"] < 0.45 and row["valence"] < 0.35:
        return "Late-Night Chill"
    elif row["danceability"] > 0.7 and row["valence"] > 0.5:
        return "Dance Floor Fillers"
    elif row["valence"] > 0.6:
        return "Feel-Good Sunshine"
    elif row["tempo"] > 130 and row["energy"] > 0.5:
        return "Workout Fuel"
    else:
        return "Melancholy Mood"

cluster_labels = cluster_profile.apply(label_cluster, axis=1).to_dict()
spotify["playlist"] = spotify["cluster"].map(cluster_labels)

Sample Output from the Algorithm

Below is a sample of what the output looks like, two contrasting playlists the algorithm discovered from the data. For each, we show the top 5 tracks by popularity: the songs it would surface first for a listener whose taste matches that cluster’s sonic fingerprint. Notice how every track within a group feels cohesive, despite no genre label ever being used.

Late-Night Chill

Cluster 0 · 866 tracks

energy 0.14 dance 0.31 valence 0.16 101 bpm

1 Fly Me to the Moon The Macarons Project

2 Sea Change Stephan Moccio

3 Nepenthe Clyde Boudreaux

4 We Contain Multitudes (from home) Ólafur Arnalds

5 Med slutna ögon Najma Wallin

High-Energy Bangers

Cluster 1 · 2152 tracks

energy 0.79 dance 0.54 valence 0.63 135 bpm

1 Caramelo Ozuna

2 Se Me Olvidó Christian Nodal

3 Amor Tóxico Christian Nodal

4 Mad Love Mabel

5 это ли счастье? Rauf & Faik

The beauty of this approach is that genre never enters the equation. The algorithm groups tracks purely by how they sound, which means it can surface unexpected connections for example a folk track next to an ambient electronic song, all brought together based on their shared acousticness and low energy.

We started with a question; how does Spotify know what I want to hear? and arrived at a working answer: represent every song as a point in a multi-dimensional feature space, then let K-Means discover the natural groupings.

The algorithm is elegant in its simplicity and is able to deliver good results.