Speech-to-Text by Google Cloud
A social networking company called MIXI, Inc. (MIXI) offers a variety of services for friends and family to enjoy together, including the social media platform mixi, the mobile game Monster Strike, and the FamilyAlbum service for sharing family photos and videos. Romi, a social robot introduced in April 2021 and one of our ongoing projects, employs Speech-to-Text by Google Cloud as its speech recognition engine.
The market for social robots has been booming since the late 2010s, and some models are becoming more and more accessible to consumers. These models range from robotic tutors that support children’s social and cognitive development to companion robots for elderly care. Romi, on the other hand, stands apart from the majority of social robots due to a noticeable difference in the caliber of dialogue.
The main advantage of Romi is that its AI, which MIXI created internally, can produce organic conversation. Romi, which is the size of a hand-held gadget and features a screen to display various face expressions, may be placed anywhere in a room. In the context of the dialogue, it reacts. Romi is an AI-powered robot that goes beyond what AI has been utilized for up until this point to do, which is to interpret the intentions behind user speech. After all, the purpose of Romi was to provide people seeking it with uplifting communication. Before the release of Romi, this type of speech recognition did not exist. Users should enjoy talking to it, including the occasional surprise reaction, we hope.
One of the most important components of Romi was the speech recognition component. A primary public cloud is used by the majority of Romi’s infrastructure, which was previously used for other services. In terms of speech recognition, we chose to test the Speech-to-Text tool from Google Cloud, which was acclaimed for its remarkably high accuracy, and the prototype’s outcomes were quite promising. Before making a final decision, we tried the services of other companies, but our assessment of Speech-to-Text has not changed.
For a social robot like Romi, the accuracy and responsiveness of Speech-to-Text made it a useful tool. With its high stability, which has been shown in supporting Romi’s workloads, and capacity to support long-term continuous development of Romi’s services, Google Cloud also offered a feeling of security.
In June 2022, about a year after Romi was released, MIXI made the decision to reevaluate the speech recognition engine due to the quick advancement of speech recognition technology. We ultimately chose to keep using Speech-to-Text. We looked at the Japanese-compatible speech recognition engines of roughly ten businesses and discovered that Speech-to-Text produced the best results. Additionally, Speech-to-Text includes a number of transcription models for speech recognition, but we discovered that the most recent short model, which focuses on short utterances, is better suited for Romi than the default model.
Additionally impressive are the cost savings that Speech-to-Text offers. In November, the billing unit was changed from 15 seconds increments that were rounded up to one second, and with Romi, significant cost savings were anticipated. To produce more genuine discussions, Romi does not employ trigger phrases like “OK Google,” therefore this is significant to us. As a result, compared to previous social robots, it can recognize and interpret more speech. While this makes the experience more user-friendly, it also involves more work and can be more expensive than typical voice recognition engines. But thanks to voice-to-Text’s revised pricing mechanism, we can keep expenses down while Romi’s voice recognition accuracy is improved.
BigQuery for improving data analysis
At first, just voice recognition was done on Google Cloud, but as Romi’s service offering grew, more functions of Romi were hosted there. One of these aspects was the early migration of the AI machine learning platform to Google Cloud. Google Cloud is particularly tempting because it offers a cloud platform at a reasonable price. We were able to make cost-saving decisions with the aid of technical account management and Premium Support.
Additionally, last year MIXI began the transition of Romi’s data analysis platform to BigQuery. BigQuery was selected because it excels at compiling and analyzing massive data in a variety of forms, which will become critical as Romi’s services require in-depth data analysis. The addition of structured query language (SQL), a language that the MIXI development team is familiar with, to BigQuery also makes it a desirable option.
We are very appreciative of the use of programs like Looker. Even for engineers, writing sophisticated queries requires a lot of work, but with Looker, even non-engineers can execute very complex analysis naturally. We started doing regular briefings for staff members who were interested in data analysis about a half year ago, and now those staff members conduct analyses on their own initiative, hold conversations based on the findings, and come up with new projects and ideas. For us, this has evolved into a standard procedure.
The development of large-scale language models (LLMs), which learn from enormous amounts of data and provide natural responses on a higher level than before, is currently what is popular in AI-based communication.
We have been investigating pertinent LLM technologies to enhance Romi’s conversational experience for some time. To operate PoC at fast speed, it is critical to be able to employ high performance GPUs as cheaply as feasible. The Google Cloud services, such as Compute Engine and VertexAI, will continue to be our main focus.