I've been attempting to build JARVIS from Iron Man in real life as a personal project, in the same way a cosplay fan might dress as Tony Stark, my inner engineer decided to go a slightly...different route. It's my belief that the MARVEL films have inspired a whole new generation of engineers, computer scientists and 'builders'. It's a great honour to be one of them. Anyway, let's get into today's topic.

The most important part of any voice activated assistant is voice recognition. This is different from voice detection, the technology I wrote about in the last post, because voice recognition’s job isn’t to detect speech, but rather classify it. This process is often called Speech-to-text.

Training a neural network from scratch to understand human speech is possible but it requires large amounts of data and training time to get anywhere close to a usable model. For this reason I chose to go with an out-of-the-box model. I wanted something reliable, open source and that would run locally on a low powered device such as a Raspberry Pi or a mobile phone. It also needed to run on the Jarvis server (something we’ll talk about in the next post).

I chose to use Mozilla DeepSpeech which is trained on a huge open source dataset and can be fine tuned to match my voice as well as configured to increase/decrease confidence in certain words. It’s a truly impressive model that has captured the imagination of both data scientists and programmers.

Using the examples provided by the DeepSpeech GitHub I was able to implement a VAD+STT model that would run locally on a Raspberry Pi or run as a Python Flask server accepting WAV file uploads. The performance on my British English voice is impressive. I did however notice it made enough mistakes to raise concern. I needed something that could act as a back up, something that was argued as the most accurate speech recognition system available. That would be Google Cloud Speech.

Arguably Google have an unfair advantage when it comes to data. They actually offer a powerful speech recognition API that accepts audio files or streams as input. It has a very high degree of accuracy but it comes with a cost. Not only does it cost money, it also costs time. With an average of 1 second latency, it’s usable but only just.

For this reason I decided to give JARVIS voice recognition powered by Google Cloud Speech API for server use only. I can upload Wav files from a lightweight client that perhaps doesn’t have enough processing power for DeepSpeech and still achieve fantastic results. I also plan to use it before processing harder NLP tasks such as names entirety recognition. This way I can ensure that the text going into my language models is accurate.

The main front end client for JARVIS will use DeepSpeech. It’s job will be to turn the microphone stream into text and then pass that text to the main server. This significantly reduces latency and gives a much more realistic conversational flow.

So there you have it. I use Mozilla DeepSpeech and Google Cloud Speech API to achieve fantastic speech recognition. There’s tonnes of documentation on how to use both - Google Cloud Speech API is most likely the easiest to get started with but DeepSpeech isn’t that far behind.

A few of you have reached out and asked for the JARVIS code. My codebase up until recently has been messy to say the least. My plan however is to create a repository on GitHub that contains the code from each of these articles. By the end, each building block of my Jarvis will be open source and tidy!

If you enjoyed this article you can join the conversation over on discord!