Can I create a good Speech Recognition Engine while having millions of recorded conversations?

Question

I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?

Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!

score 3 · Accepted Answer · edited Apr 12 '19 at 10:08

Yes, having lots of recorded conversations is great for building a speech recognition system. You will still have to create training samples (Each sample will be parts of Wave file --> text), but you will need lesser number of samples.

High level steps are :

Train a GAN on raw audio
Train a Language model on raw text data (it needs not be from these conversations, but has to be from same domain). For example, if conversations are related to medical, train language model on medical text.
Merge these models and train on labeled samples

For step 1, Google WaveNet is a good example (it is eventually used for Text-to-Speech, it is a component in Speech-to-Text as well)

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Papers that cover design and overall approach :

https://arxiv.org/abs/1711.01567 https://arxiv.org/abs/1803.10132

Can I create a good Speech Recognition Engine while having millions of recorded conversations?

1 Answers1