What is the difference between VAD and Speaker Segmentation?

Question

I'm not sure I can distinguish and understand the difference between:

VAD (Voice Activity Detection) and
Speaker Segmentation

I understand that:

VAD - split audio to segments of speech or not speech
Speaker Segmentation - split audio to segments of not speech and different speakers

for example:

VAD                  = [not speech, speech,  not speech,         speech,      not speech]
Speaker Segmentation = [not speech, speech , not speech,  speech A, speech B, not speech]

Am I right ?

Is my example correct ?

score 1 · Answer 1 · answered Apr 07 '23 at 07:19

In Voice Activity Detection (VAD) there is no guarantee that the activity is actually speech - just voice. For example, it may also trigger on non-speech voice sounds, such as singing, humming et.c. Basic approaches to VAD, like energy-based VAD, may also easily trigger on music with harmonic content in the same frequency spectrum as voice - such as violin, guitar, et.c. The reason some VADs are so simple, is that they are used as compute-efficient pre-processing steps.

Speech Segmentation is a slightly stricter task formulation, which may aim to avoid these problems that VAD has.

Speaker Segmentation does not have an explicit "speech" class. Although one could be synthesized by speech = any(speakerA, speakerB, ...).

So I would adjust your example slightly to become:

VAD = [ other, voice, other, voice, other] 
Speaker Segmentation = [ speakerA, no-speaker, speaker-B, no-speaker ]

What is the difference between VAD and Speaker Segmentation?

1 Answers1