Enhancing the naturalness via speech-to-text

August 11, 2023

178

Speech-to-Text by Google Cloud

A social networking company called MIXI, Inc. (MIXI) offers a variety of services for friends and family to enjoy together, including the social media platform mixi, the mobile game Monster Strike, and the FamilyAlbum service for sharing family photos and videos. Romi, a social robot introduced in April 2021 and one of our ongoing projects, employs Speech-to-Text by Google Cloud as its speech recognition engine.

The market for social robots has been booming since the late 2010s, and some models are becoming more and more accessible to consumers. These models range from robotic tutors that support children’s social and cognitive development to companion robots for elderly care. Romi, on the other hand, stands apart from the majority of social robots due to a noticeable difference in the caliber of dialogue.

The main advantage of Romi is that its AI, which MIXI created internally, can produce organic conversation. Romi, which is the size of a hand-held gadget and features a screen to display various face expressions, may be placed anywhere in a room. In the context of the dialogue, it reacts. Romi is an AI-powered robot that goes beyond what AI has been utilized for up until this point to do, which is to interpret the intentions behind user speech. After all, the purpose of Romi was to provide people seeking it with uplifting communication. Before the release of Romi, this type of speech recognition did not exist. Users should enjoy talking to it, including the occasional surprise reaction, we hope.

One of the most important components of Romi was the speech recognition component. A primary public cloud is used by the majority of Romi’s infrastructure, which was previously used for other services. In terms of speech recognition, we chose to test the Speech-to-Text tool from Google Cloud, which was acclaimed for its remarkably high accuracy, and the prototype’s outcomes were quite promising. Before making a final decision, we tried the services of other companies, but our assessment of Speech-to-Text has not changed.

For a social robot like Romi, the accuracy and responsiveness of Speech-to-Text made it a useful tool. With its high stability, which has been shown in supporting Romi’s workloads, and capacity to support long-term continuous development of Romi’s services, Google Cloud also offered a feeling of security.

In June 2022, about a year after Romi was released, MIXI made the decision to reevaluate the speech recognition engine due to the quick advancement of speech recognition technology. We ultimately chose to keep using Speech-to-Text. We looked at the Japanese-compatible speech recognition engines of roughly ten businesses and discovered that Speech-to-Text produced the best results. Additionally, Speech-to-Text includes a number of transcription models for speech recognition, but we discovered that the most recent short model, which focuses on short utterances, is better suited for Romi than the default model.

Additionally impressive are the cost savings that Speech-to-Text offers. In November, the billing unit was changed from 15 seconds increments that were rounded up to one second, and with Romi, significant cost savings were anticipated. To produce more genuine discussions, Romi does not employ trigger phrases like “OK Google,” therefore this is significant to us. As a result, compared to previous social robots, it can recognize and interpret more speech. While this makes the experience more user-friendly, it also involves more work and can be more expensive than typical voice recognition engines. But thanks to voice-to-Text’s revised pricing mechanism, we can keep expenses down while Romi’s voice recognition accuracy is improved.

BigQuery for improving data analysis

At first, just voice recognition was done on Google Cloud, but as Romi’s service offering grew, more functions of Romi were hosted there. One of these aspects was the early migration of the AI machine learning platform to Google Cloud. Google Cloud is particularly tempting because it offers a cloud platform at a reasonable price. We were able to make cost-saving decisions with the aid of technical account management and Premium Support.

Additionally, last year MIXI began the transition of Romi’s data analysis platform to BigQuery. BigQuery was selected because it excels at compiling and analyzing massive data in a variety of forms, which will become critical as Romi’s services require in-depth data analysis. The addition of structured query language (SQL), a language that the MIXI development team is familiar with, to BigQuery also makes it a desirable option.

We are very appreciative of the use of programs like Looker. Even for engineers, writing sophisticated queries requires a lot of work, but with Looker, even non-engineers can execute very complex analysis naturally. We started doing regular briefings for staff members who were interested in data analysis about a half year ago, and now those staff members conduct analyses on their own initiative, hold conversations based on the findings, and come up with new projects and ideas. For us, this has evolved into a standard procedure.

The development of large-scale language models (LLMs), which learn from enormous amounts of data and provide natural responses on a higher level than before, is currently what is popular in AI-based communication.

We have been investigating pertinent LLM technologies to enhance Romi’s conversational experience for some time. To operate PoC at fast speed, it is critical to be able to employ high performance GPUs as cheaply as feasible. The Google Cloud services, such as Compute Engine and VertexAI, will continue to be our main focus.

1 COMMENT

Robot Candidates For NASA Mars Mission: Caltech Leads September 1, 2023 At 4:04 pm
[…] 2020, academics Mory Gharib and Alireza Ramezani were kicking around the idea of a transforming robot, which is now being considered for work that is quite literally out of this world: NASA Mars Rover […]
Log in to leave a comment

Enhancing the naturalness via speech-to-text

Speech-to-Text by Google Cloud

ADATA SC750 External SSD: Your High-Speed Data Companion

Probable Root Cause: Improving Instana’s Observability

Microwave 2T XMC-80D Wins iF Design Award 2024 & Red Dot

1 COMMENT

LEAVE A REPLY Cancel reply

Recent Posts

ADATA SC750 External SSD: Your High-Speed Data Companion

Probable Root Cause: Improving Instana’s Observability

Microwave 2T XMC-80D Wins iF Design Award 2024 & Red Dot

Hex-LLM: High-Efficiency LLM Serving to Vertex AI with TPUs

Toshiba & Quantonation Teams Up to Advance Quantum Science

Modern Art of Bahia Museum’s Unique Heritage Collection

Popular Post

ASRock’s creative AMD FP6 series thin mini-ITX motherboard

ASUS ProArt PA602 The Most Elegant Computer Case!

Cardea Z540 SSD Revolutionizes Storage

What is Azure Policy in Microsoft Azure

MSI Motherboards with Intel Application Optimization

Boost Your Apps Now: Amazon ElastiCache Serverless Unveiled!

About Us

POPULAR CATEGORY