Week 3

Third week already?

I cannot believe how fast time has flew in the last three weeks in my REU internship! I am more than grateful for this fantastic experience. The work that I have done has been fun and I did really learn lots, but there’s still a lot to learn especially with the cloud platforms, nodejs, and various services that are dependent on cloud/nodejs.


Automatic Speech Recognition (ASR) is a simple service if one was to think about it. Before the REU internship, I kind of knew that the cloud services can be simple and complicated at the same time, but I didn’t realize that there’s a lot more to it than I thought. In our project, we have three primary ASR engines to work with. The 3 ASRs are: Microsoft Azure (AZ), Google Cloud Platform (GCP), and IBM Cloud (IBMC). Specifically in the ASR area, MS Azure has done a wonderful job of making the user experience and cloud systems administration very easy and convenient to use.


My co-intern, Emelia, was assigned the task to get AZ up and running. So with that being said, I don’t have the experience that she has building a Speech-to-Text platform from scratch with AZ, but from my observation, AZ is definitely the most convenient to use as an ASR. The specific reason for this is that MS already built a Software Development Kit (SDK) that can be packaged with nodejs and be used as sample code. Their nodejs library has a SDK JavaScript (JS) file that has more than 15K lines of code. I know this for a fact because I was assigned to work with GCP ASR. Don’t get me wrong, all three ASRs have wonderful documentation and stuff to note when using their ASR Application Programming Interfaces (API). GCP and IBMC can be tricky to work with if you want to code both server-side and client-side video conferencing with captioning with their SDKs. They do have SDKs, but unlike MS AZ, it’s not packaged and readily to be used. AZ has multiple different demos that we can use and get an idea of how it all works. So one of the demos that they had is very close to what we are working on. GCP and IBMC has all the documentation, and many sample codes, but they do not have the sample codes that are similar to what we are working on. This proves to be more difficult than we imagine.

Modifying ASR samples/WebRTC samples

So with our basic knowledge of nodejs, various libraries like expressjs, socket, and ASR APIs sample codes. We started to realize that we needed to learn more of the basics in general areas and specific libraries in order for us to understand how it all works in an overview. It’s tough to do this when we do a lot of research, and trying out various code samples, snippets, and things that we think would be beneficial to our project’s goals. Especially when all we do is to snip the code snippets together, knowing that it’s there for a purpose, but when it does not work, we can’t figure out what has gone wrong with it.


So with what I said in the previous paragraph, I’ve worked on GCP ASR for 2 days this week and try to implement the Web Speech API into the browser. Web Speech API uses Google ASR as a backend engine. This is especially convenient because Google Chrome supports Web Speech API straight out of the box. We suspect that the other browsers will support this feature in the near future. I’ve mentioned this a couple paragraphs ago, it’s tough to work with cloud platforms when you do not know what you’re doing. So one of the biggest problems I faced with GCP ASR was the authenication itself. As this is my first time adventuring into projects that depend on both the server where our project resides on AND the cloud API where it resides somewhere in the Eastern US. The Cloud API in general provides us the access to their ASR, API keys, and other services. Getting API key to work in a file with source code can be daunting at first. This was a struggle I had, and eventually I finally broke through and got it all to work.

Web Speech API has ONE cons

With the google authenication that finally worked for a nodejs backend that uses Web Speech API. I broke through one obstacle and got hit with another obstacle. This time around, is with the HTTP and HTTPS protocol. I stumbled on this same problem last week without realizing that I would bump into this situation not once, but twice! So, the thing with the Web Speech API in a chrome-based browser is that it requires HTTPS protocol. If you initialized a nodejs server and accessed it via a chrome browser, and if you wanted to use your microphone as an input, it rejects if the protocol is HTTP, and will accept if it is HTTPS. Theoretically, this is to stop hackers from snooping in on the same WiFI network as you are on. If you wanted to use WebRTC or Web Speech API on a public WiFi network, hackers could snoop in. So that’s why Web Speech API requires HTTPS as a protocol. So I spent some time learning how to implement HTTPS into a nodejs server that uses Web Speech API. It is still an ongoing process that I’ll resume well into the fourth week next week.

Written on May 25, 2020