The Most Versatile Sound Machine in the World Makes Its Debut. A new generative AI model Fugatto from NVIDIA can produce any mix of music, speech, and noises given text and audio as inputs.
A group of academics studying generative AI developed a Swiss Army knife for sound that lets users manipulate the audio output using just words.
Some AI models are capable of creating music or altering voices, but none are as skilled as the new product.
Using any combination of text and audio files, Fugatto (short for Foundational Generative Audio Transformer Opus 1) creates or modifies any mix of voices, noises, and music that is defined using prompts.
It may alter a voice’s accent or mood, add or subtract instruments from an existing song, or even allow users to make sounds they have never heard before. For instance, it can generate a musical clip in response to a text prompt.
This multi-platinum producer and composer is also the cofounder of One Take Audio, a cutting-edge firm that is a part of NVIDIA’s Inception program. “The motivation comes from sound. That’s what inspires me to compose music. It’s amazing to think that can make completely original sounds in the studio at any time.
A Sound Understanding of Audio
Fugatto is the first fundamental generative AI model that demonstrates emergent characteristics skills that result from the interplay of its different training abilities and the capacity to mix free-form instructions. It supports a wide range of audio production and transformation jobs.
In the future, unsupervised multitask learning in audio synthesis and transformation will arise from data and model size, and Fugatto is the first move in that direction.
A Sample Playlist of Use Cases
Music producers, for instance, may use Fugatto to rapidly modify or prototype a song concept, experimenting with various instruments, vocals, and genres. They might also improve an old track’s overall audio quality and add effects.
The history of technology is entwined with the history of music. Rock & roll was introduced to the world by the electric guitar. Hip-hop emerged with the advent of the sampler. “Its’re writing the next chapter of music with AI.” It’s really thrilling that to have a new instrument and tool for creating music.
By applying various dialects and emotions to voiceovers, an advertising firm may use Fugatto to swiftly target an existing campaign for various locations or scenarios.
Any voice a speaker wants might be used in language learning resources. Consider an online course that is delivered by any friend or relative.
The concept might be used by video game creators to adapt prerecorded elements in their titles to the shifting action as players progress through the game. Alternatively, they might use alternative voice inputs and written directions to generate new assets instantly.
Making a Joyful Noise
For example, Fugatto can meow on a saxophone or bark on a trumpet. The model can generate whatever that users may describe.
Researchers discovered that it could perform tasks that it was not pretrained on, such as producing a high-quality singing voice from a text cue, using fine-tuning and modest quantities of singing data.
Users Get Artistic Controls
Fugatto‘s uniqueness is enhanced by a number of skills.
The model combines instructions that were only observed individually during training using a method known as ComposableART during inference. For instance, a series of prompts can request content that is delivered in a French accent and with a depressing tone.
Users have fine-grained control over text instructions, such as the degree of grief or the accent’s weight, with to the model’s capacity to interpolate across instructions.
Additionally, Fugatto lets users construct soundscapes they have never seen before, such a thunderstorm fading into a morning with the sound of birdsong, in contrast to most models that can only replicate the training data they have been exposed to.
A Look Under the Hood
Building upon the team’s earlier work in fields including speech modeling, audio vocoding, and audio comprehension, Fugatto is a fundamental generative transformer model.
The entire version was trained on a bank of NVIDIA DGX computers with 32 NVIDIA H100 Tensor Core GPUs and employs 2.5 billion parameters.
A multicultural team from India, Brazil, China, Jordan, and South Korea all contributed to the creation of Fugatto. Their partnership strengthened Fugatto’s multilingual and multiaccent capabilities.
Creating a mixed dataset with millions of training audio samples was one of the most challenging aspects of the endeavor. To achieve more accurate performance and enable new activities without requiring extra data, the team used a multidimensional approach to produce data and instructions that significantly increased the variety of jobs the model could execute.
Additionally, they examined pre-existing datasets to uncover novel connections between them. The entire project took almost a year to complete.
The group then demonstrated Fugatto making electronic music with dogs barking in rhythm with the beat after being given a signal.
“It truly warmed my heart when the group broke up with laughing.”
Hear what Fugatto can do: