0

I'm not sure I can distinguish and understand the difference between:

  • VAD (Voice Activity Detection) and
  • Speaker Segmentation

I understand that:

  • VAD - split audio to segments of speech or not speech
  • Speaker Segmentation - split audio to segments of not speech and different speakers

for example:

VAD                  = [not speech, speech,  not speech,         speech,      not speech]

Speaker Segmentation = [not speech, speech , not speech, speech A, speech B, not speech]

Am I right ?

Is my example correct ?

user3668129
  • 769
  • 4
  • 15

1 Answers1

1

In Voice Activity Detection (VAD) there is no guarantee that the activity is actually speech - just voice. For example, it may also trigger on non-speech voice sounds, such as singing, humming et.c. Basic approaches to VAD, like energy-based VAD, may also easily trigger on music with harmonic content in the same frequency spectrum as voice - such as violin, guitar, et.c. The reason some VADs are so simple, is that they are used as compute-efficient pre-processing steps.

Speech Segmentation is a slightly stricter task formulation, which may aim to avoid these problems that VAD has.

Speaker Segmentation does not have an explicit "speech" class. Although one could be synthesized by speech = any(speakerA, speakerB, ...).

So I would adjust your example slightly to become:

VAD = [ other, voice, other, voice, other] 
Speaker Segmentation = [ speakerA, no-speaker, speaker-B, no-speaker ]
Jon Nordby
  • 1,557
  • 10
  • 14