ZeroHSI is a new technique that uses neural rendering and video generation models to create lifelike 4D human-scene interactions. This method does not require large amounts of paired motion-scene training data. ZeroHSI creates interactions in a variety of 3D environments, including ones with dynamic objects, by utilising pre-existing video models that have been trained on human movements and employing differentiable rendering. In order to assess their zero-shot synthesis abilities in a variety of indoor and outdoor scenarios with a range of interaction kinds, the researchers also present the AnyInteraction dataset. Their approach shows that it is possible to generate diverse and contextually relevant human-scene interactions without depending on particular training examples for every situation.
Without requiring paired motion-scene training data, ZeroHSI uses video generation models and neural rendering to create authentic 3D human motion to interact with a variety of scenes (indoor/outdoor, real reconstructed/synthesized) and dynamic objects.
Abstraction
Generating human-scene interaction (HSI) is essential for robotics, virtual reality, and embodied AI applications. Existing techniques can create plausible human-object interactions and synthesise realistic human motions in 3D scenes, but they mainly rely on datasets that contain paired 3D scene and motion capture data, which are costly and time-consuming to gather across a variety of environments and interactions. ZeroHSI, a new method that combines neural human rendering with video creation to enable zero-shot 4D human-scene interaction synthesis.
Its main contribution is to employ differentiable rendering to reconstitute human-scene interactions by utilising the rich motion priors learnt by the most advanced video generation models, which have been trained on a large number of natural human movements and interactions. Without the need for ground-truth motion data, ZeroHSI can simulate realistic human movements in both static settings and environments with dynamic objects. They test ZeroHSI using a carefully selected dataset of distinct indoor and outdoor scene types and interaction prompts, showing that it can produce a wide range of contextually relevant human-scene interactions.
Real Reconstructed Scenes with Generated Interactions
The ability to create human-scene interactions on actual 3D scenes that have been recreated is demonstrated by ZeroHSI. A person lifting a vase is one example.
ZeroHSI creates an HSI video and the matching 4D HSI for every one of these encounters. Text prompts outlining the intended interaction serve as the basis for these instances.
Long-Term Interactions Created
By conditioning on a series of text prompts, ZeroHSI can also create long-term relationships. A person coming towards the table, watering flowers with a watering can, setting the watering can down on the table, and then leaning on the table are examples of long-term HSIs that are formed.
These illustrations demonstrate how ZeroHSI may use a set of textual instructions to generate extensive and cohesive interactions inside a 3D world.
Object-Based Generated Interactions on Synthetic Scenes
Synthetic 3D settings with objects are another area in which ZeroHSI excels. In contrast to previous approaches like LINGO and CHOIS, ZeroHSI performs well in producing interactions. Additionally, ZeroHSI can create interactions in scenes that are entirely artificial.
These findings imply that ZeroHSI is capable of managing interactions with certain items in artificial settings.
These examples also highlight ZeroHSI’s adaptability in producing contextually appropriate human movements in a range of artificial environments.
Overview of the Method
ZeroHSI uses a 3D scene, an interactable object, a linguistic description, and beginning states as input to create motion sequences for both humans and dynamic objects. The method starts with HSI video creation that is dependent on the text prompt and rendered beginning state. Then, by reducing the difference between the rendered and generated reference movies, ZeroHSI optimises per-frame camera pose, human pose parameters, and object 6D pose using differentiable neural rendering. Without depending on paired motion-scene data, ZeroHSI is able to generate realistic 4D human-scene interactions and ground the resultant video into a consistent 3D scene with this optimisation method.
Read more on Llama 4: Smarter, Faster, More Efficient Than Ever
Read more on AI Labyrinth: Cloudflare Defense Against Rogue AI Crawlers