How can you efficiently cluster speech segments by speaker?

Question

We have ~30 audio snippets, of which around 50% are from the same speaker, who is our target speaker, and the rest are from various different speakers. We want to extract all audio snippets from our target speaker, so basically figure out which voice most frequently occurs and then select all audio samples with this voice.

For this purpose, we tried using the resemblyzer library to generate speaker-level embeddings from our audio samples, and then apply PCA to see if we can detect any clusters:

from resemblyzer import VoiceEncoder
encoder = VoiceEncoder()
embeddings = []
for snippet in audio_snippets: # audio_snippets is a list numpy representations of our recordings
    embeddings.append(encoder.embed_utterance(snippet, return_partials=False))
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=0.8, linkage='ward')
labels = clustering.fit_predict(embeddings)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(embeddings)
plt.figure()
for i, k in enumerate(embeddings):
    x_val, y_val = X_pca[i, 0], X_pca[i, 1]
    plt.scatter(x_val, y_val, color=f"C{labels[i]}")
    plt.annotate(k, (x_val, y_val), xytext=(5, 2), textcoords="offset points")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA of Speaker Embeddings (Top 2 Speakers/Video)")
plt.show()

From this, we would expect a clear cluster of 15 audio snippets, with the remaining 15 being loosely scattered all across. However, this does not cluster the snippets by speaker, as in most clusters there are still recordings of other speakers mixed in and it is generally not very accurate:

Is there a more effective way to accomplish this?

score 6 · Answer 1 · answered Mar 11 '25 at 19:43

I think it would be useful to first visualise the embedding in order to see how good it is for telling different speakers apart.

My approach would be to use UMAP to project the embeddings down to 2D, and then scatter plot those results coloured by speaker.

Ideally, you would find that different colours (speakers) are generally separated from each other. If the colours overlap a lot (meaning that the embedding doesn't seem to separate different speakers), then you may need to reconsider how the embeddings are derived and/or whether the data's limitations are a problem.

If the embeddings look good, then I would try using HDBSCAN to cluster the embeddings$^\dagger$. It is important to project the embeddings down to fewer dimensions before running clustering ($<50$ for HDBSCAN) - this can be done using UMAP(n_components=5, min_dist=0).

$\dagger$ HDBSCAN will return as many or as few clusters as it finds; you can't limit it to 15, so some post-processing would be required. You could alternatively use a clustering algorithm that accepts an n_clusters= argument, like KMeans. It's worth trying a few different algorithms.

score 3 · Answer 2 · answered Mar 12 '25 at 16:27

The solution was to first make sure you have clean audio files. For this purpose, the resemblyzer.preprocess_wav method is very useful. Resemblyzer also has a clustering demo out, which worked well in conjunction with the suggestions by @MuhammedYunus: https://github.com/resemble-ai/Resemblyzer/blob/master/demo04_clustering.py

Combining these approaches generated a stable working result.

How can you efficiently cluster speech segments by speaker?

2 Answers2