I have at my disposal millions of wav files containing recorded conversations between employees and clients, i'm doing some research on the possibility of creating a good speech recognition engine. I've tested Google's Speech-To-Text and it's great. Is it possible to create something similar? ( Of course, no one can beat the quality and quantity of the data Google has but how close one can get?). And of course, what are the technical limits ( like hardware needed for this kind of learning ) and how much time should it take to achieve it?
Note : i'm a beginner in ML, so far , i've done some binary and multiclass classification,I have an idea on Neural Networks but no work done on that. The simpler the answer the easier for me to understand , Thanks!