Google DeepMind’s Genie 2: 3D Worlds At Your Fingertips

December 12, 2024

191

Creating an infinite number of varied training contexts for generic agents in the future

Google DeepMind presents Genie 2, a foundation world model that can create an infinite number of playable, action-controllable 3D environments for embodied agent training and evaluation. It can be played by a human or AI agent with keyboard and mouse inputs based on a single prompt image.

In the field of artificial intelligence (AI) research, games are crucial. They are the perfect settings for securely testing and developing AI capabilities because of their captivating qualities, distinctive mix of difficulties, and quantifiable advancements.

Indeed, since its inception, Google DeepMind has placed a high value on gaming. Games have always been at the forefront of its research, from its early work with Atari games to innovations like AlphaGo and AlphaStar to its study of generalist agents in partnership with game producers. However, the availability of suitably rich and diverse training environments has historically been a hurdle for training more general embodied agents.

Google DeepMind demonstrates how Genie 2 may allow future agents to be trained and assessed in an infinite number of new environments. Additionally, its research opens the door to innovative, new processes for interactive experience prototyping.

New features of a foundational world model

World models have thus far mostly been limited to simulating certain domains. It presented a method for creating a wide variety of 2D worlds in Genie 1. The generality of Genie 2, which Google DeepMind demonstrates today, has advanced significantly. A wide variety of intricate 3D worlds can be created with Genie 2.

As a world model, Genie 2 is capable of simulating virtual environments, including the results of any action (such as jumping, swimming, etc.). Like other generative models, it was trained on a sizable video dataset and exhibits a range of emergent capabilities at scale, including physics, intricate character animation, object interactions, and the capacity to model and anticipate the actions of other agents.

Here are some sample videos of individuals using Genie 2. GDM’s cutting-edge text-to-image model, Imagen 3, generates a single image to stimulate the model for each case. This implies that anyone can write a description of a world they desire, pick their preferred representation of that concept, and then enter and engage with the freshly constructed world (or have an AI agent trained or assessed in it). Genie 2 mimics the subsequent observation once a person or agent provides a keyboard and mouse action at each stage. The majority of the samples displayed last 10–20 seconds, however Genie 2 can create continuous worlds for up to a minute.

Action Controls

When a character is pressed on a keyboard, Genie 2 recognizes the character and moves it appropriately in response to the input.

Creating counterfactual scenarios

For training agents, GDM can simulate counterfactual experiences by generating several paths from the same beginning frame.

Long-term memory

Parts of the world that are no longer visible can be remembered by this model, and when they are observable again, they can be rendered precisely.

Creating lengthy videos with freshly created content

It keeps a stable universe for up to a minute and creates fresh, believable content on the spot.

Various settings

Various viewpoints, including first-person, isometric, and third-person driving videos, can be produced with Genie 2.

Three-dimensional structures

It acquired the ability to produce intricate 3D visual scenarios.

The affordances and interactions of objects

It simulates a variety of object interactions, including blowing up balloons, opening doors, and firing explosive barrels.

Animation of characters

Genie 2 acquired the ability to animate a variety of characters engaging in diverse tasks.

NPCs

It simulates complicated interactions with other agents.

Physics

Water effects are modelled by Genie 2.

Smoke

It simulates the effects of smoking.

Gravity

Gravity is modelled by it.

Lighting

Point and directed lighting are modelled by Genie 2.

Thoughts

Genie 2 simulates coloured lighting, bloom, and reflections.

Playing from pictures of the real world

Additionally, Genie 2 can be triggered by real-world pictures, such as grass waving in the wind or river water flowing.

Genie 2 makes quick prototyping possible

Researchers can swiftly test and train embodied AI agents in new surroundings with Genie 2, which makes it simple to quickly prototype a variety of interactive experiences.

Concept art and drawings can be transformed into completely interactive settings using Genie 2’s out-of-distribution generalisation features. This speeds up research by allowing designers and artists to swiftly experiment, which helps bootstrap the environment design creative process.

AI entities operating within the world model

GDM’s researchers may also develop evaluation tasks that agents haven’t seen during training by using Genie 2 to swiftly create rich and diverse settings for AI agents.

Image generated by Imagen 3 — Image credit to Google

The SIMA agent is made to obey natural-language instructions to do tasks in various 3D game settings.

SIMA can also be used to assess Genie 2’s potential. Here, we give SIMA instructions to scan the area and investigate behind the home in order to verify Genie 2’s capacity to create consistent surroundings.

Genie 2 is the way to address a structural issue of safely training embodied agents while attaining the breadth and generality necessary to advance toward AGI, even though this study is still in its infancy, there is a lot of room for advancement in terms of agent and environment generation capabilities.

Diffusion world model

A sizable video dataset was used to train the autoregressive latent diffusion model Genie 2. Following an autoencoder, the video’s latent frames are sent into a large transformer dynamics model, which is trained using a causal mask akin to those employed by big language models.

It is possible to sample Genie 2 in an autoregressive manner at inference time, taking individual actions and past latent frames frame by frame. To increase action controllability, it employ classifier-free guidance.

Diffusion world model — Image credit to Google

Creating its technologies in an ethical manner

Genie 2 demonstrates how foundational world models can be used to speed up agent research and create a variety of 3D environments. GDM is excited to keep advancing Genie’s world-generating capabilities in terms of generality and consistency, even if this research direction is still in its infancy.

The goal of its research, like that of SIMA, is to develop more broad AI systems and agents that can comprehend and securely perform a variety of jobs in a way that benefits people both online and offline.