Week 2
ASR SDKs
With the first week’s orientation and development environment settled, my co-intern and I could continue working on the Automatic Speech Recognition (ASR) assets that include using their Software Development Kits (SDKs) to install both JavaScript(JS) and Console (CLI) samples on our development server. I started the week with a focus on navigating around the IBMcloud (IBMC) console interface with a few audio files. At the time, I did not have any audio files present to use with IBMC ASR. This meant that I had to hunt for a few sample audio files online and make sure that they were open source and no license issues.
Sample audios
As the hunt continued, we learned a great deal about how audio files worked especially with the containers and streams. Thanks to our mentor, Dr. Raja Kushalnagar and another co-worker on the project team for supplementing us with the information necessary to make our project successful with sampling audio files. When I found a sample folder of various audio files, I couldn’t believe it when I saw that the file size was 34 gigabytes! Yes, thirty-four gigabytes worth of audio all in one folder! I was shocked to see that. I then realized we should start categorizing our audio files into three categories: Short, Medium, and Long.
Audio categories
In the short category, it would consist of audio files that have less than 1 minute in length. Medium: 1 minute - 5 minutes; Long: Longer than 5 minutes. Even in a sample audio folder with 34 GB, it wasn’t difficult finding audio files that were less than a minute in length. I even had a greater difficulty finding both medium and long audio files in open source samples online. Thankfully, one of the co-workers on the team said that there might be a way for him to get internal audio files from previous projects at Gallaudet University, so he had to ask for permission, and it was granted. In these audio files, the length was great, just exactly what we needed for our projects. The audio files itself was set up as a single file with two microphone inputs. So with this, we still needed a way to split it into mono channels, so we learned how to use audacity and split the audio files into mono channels, and now the audio file samples were perfect for our development testing purposes.
WebRTC
In our project, our main goal was to provide automatic captions on a video conferencing platform. Many video conferencing platforms out there are proprietary which also means it’s closed source. Therefore, we couldn’t just get the codebase and try to modify it to fit our project’s purposes. That’s the first reason we started to work on WebRTC as it’s a open source video platform that’s flexible to serve many needs, not just video conferencing. With the WebRTC’s codebase on our server, in order to run it, we need to use nodejs to run it in a browser and display it as if it was a functioning server-side program inside a web browser client. Working with web browsers is another reason we work with JavaScript heavily in our project. We made some process to learn how to set WebRTC server up with another open source platform called “EasyRTC”.
EasyRTC
It’s basically WebRTC with easy implementations and setup, so we didn’t have to code everything from scratch to get working video and audio. This is especially important as we only have 10 weeks in our internship, we cannot waste any time building anything from scratch, but to use open source stuff, and modify it to meet our needs. So with easyRTC set up, we started looking into modifying the code to meet what we want for the project. Unfortunately, I’ve had tough luck getting some things to work properly under a nodejs environment like the CSS and static files. I have a basic knowledge of how nodejs works, so to no luck to get things to work like we wanted to see. We started to shift our focus onto the WebRTC platform and work from there since some static files were able to be loaded via the webrtc platform. This was a wall that I faced for 2 days trying to get nodejs + expressjs + css all to work together. Unfortunately, it didn’t all work out. Gratefully, we are still able to do it all successfully with the webrtc platform.
Streams + transcripts
Our next focus was getting streams up and running. The problem we had to solve was the fact that the stream wasn’t necessarily live in order to have a great controlled experiment with all ASRs using the same video/audio files and see the results of each ASR engine, and gather data and results. Emelia, my co-intern spent most of her focus on getting the virtual webcam stream up and running. Whereas I spent my focus on getting a transcript page up and running to show the text being translated from the ASR engine. With all that being successful after a couple days, we started to study the ASRs and their nodejs counterparts and see how we could get it to work seamlessly with the WebRTC platform. After studying and understanding some of the implementations, Emelia worked on getting some nodejs scripts work properly in a CLI environment with the sample audio file whereas I continued to work on the transcript webpage and modify some things. Towards the end of the week, we started to shift our focuses to IBMC and GCP since MicroSoft Azure (AZ) had it working out of the box for what we needed to see in a console environment. IBMC and GCP didn’t have it working out of the box, but we are pretty sure they have the features in their API. We just needed to figure out how to get it to work similarly to the AZ console environment. We are going to resume this focus next week with GCP and IBMC environments.