Week 6

WebRTC + Web Speech API

In the past couple weeks, at some point, I did some work on getting the transcript page working with Web Speech API. It was easy because there were guides that I followed in order to get it to work. There were a couple obstacles that I faced, but I overcame it easily. With a bit of understanding and knowledge that I had with Web Speech API, it was time for me to implement this into a WebRTC client. In short, WebRTC platform offers many different methods of communications online, whether it be via video, audio, or instant messages.

In our project, we focus on the video/audio samples of WebRTC with the custom WebRTC implementation that Norman worked on to display Real-time Text (RTT) for conceptual purposes. Getting the RTT to work with WebRTC is already complicated by itself with a lot of mathematical calculations for the positioning, sizing of the caption bar. This is mainly due to the videos’ responsive design that size itself based on the display resolution and length (mobile phones, desktops, laptops, for example). Thankfully, we do not have to go through all this trial and error just to get it right.

With that aside, Emelia and I could start focusing on getting our ASR engines (Web Speech API for me, MS Azure for her) to display the text it recognizes from speech (voiced) in the same area that RTT was used for. This saves us a lot of time and focus on the functionability of the ASRs. To say this aloud, it seems easy to get WebRTC ASR up and running inside the RTT caption bar. Well this proved to be more difficult than I thought. Since Emelia did get her ASR engine to work inside a RTT caption bar, I studied her code and tried to see how I could implement it similarly and differently because those two ASR engines are completely different. MS Azure seemed to work right out of the box because it did not take Emelia more time to get it to work properly as it did for me with the Web Speech API. Web Speech API worked differently from MS Azure, this meant that some bits of the MS AZ code wouldn’t work with Web Speech API, and I had to customize the code to fit the Web Speech API’s needs.

It took me almost 3 work days to finally get it up and running. Along the way, I faced some bugs, and one notable bug (Thanks to Norman for finding it) it was the invisible HTML element that did not get it to work. If you were to read the source code, you would see the transcript div element in the code with nested code for textarea, p, and various elements. Well when you go to the webpage and inspect it with a browser’s built-in developer tools, you would only see the transcript div element and NOTHING inside it. This is what the web browser was interpreting the whole time. So finally caught this notable bug, and afterwards, it worked properly.

ASR switching

ASR switching is a concept that isn’t widely adopted. The main objective is to switch between ASRs and there are a few real-world examples out there but they are not as convenient, and require some additional steps in order to switch the ASR. In our WebRTC project, we already have a hamburger menu in place (on top left corner of the web page) like where hamburger menu is usually located commonly across many different web applications or mobile appications. Within the hamburger menu, there are various options like “Leave the room” (Leave video conference room), “Reset settings” (Clear cache, and then browser asks you permissions to use camera and microphone), and a couple other things that are irrelevant to our project.

Emelia and I have added our buttons to the hamburger menu: Transcript pop-up button (Emelia’s task) and ASR switching button (my task). With this, I knew that much about web speech API, and not so much about MS Azure which Emelia worked mostly on. I decided that I was not going to learn all the things that MS Azure can do for me, so I cloned her directory, and work off that directory to merge it to contain both the MS AZ ASR and Web Speech ASR. It was easy getting these two in the same directory. The tough part I faced was button event listeners, and actually getting the functions to work. I’ve encountered several bugs, but most bugs were fixed easily. Norman helped me troubleshoot a couple bugs, for a couple hours. It took me about 2 days to get it all up and running. Now there are 3 buttons in the hamburger menu: (Switch to AZ, Switch to Web Speech, and Stop all ASRs). They all are working.

Web Speech API transcript pop-up

With the ASR switching set aside, it was the time for me to implement the way Emelia had her transcript pop-up to work with the MS AZ ASR. Again, as AZ and Web Speech are completely different ASR engines. I am working through her transcript code and trying to get it to work properly with Web Speech. It works well with the AZ ASR engine. Unfortunately, it still doesn’t work for the Web Speech engine. I now have a bug to work around and so far, I have not found a fix for it yet. Hopefully next week, I’ll find a fix to it with Norman’s help.

Written on June 15, 2020