3D Assets and World Building
In the Metaverse world building and 3D asset as well as scene creation will require the most heavy lifting using today’s manual and procedural methods. With these classical methods the realization of a Metaverse with global presence will be impossible. In my 2014 talk I spoke about both the potential and necessity for generative AI to produce the 3D content required to build virtual worlds. Up until 2020 it felt like a long way out, then UC Berkeley students Ben Mildenhall, Pratul Srinivasan Matthew Tancik et al. released their paper on NeRF titled “Representing Scenes as Neural Radiance Fields for View Synthesis”. This was a game changer for synthesizing 3D scenes from 2D images. NeRF has inspired additional research including NDDF and NELF which sought to improve on the long rendering times that plagued NeRF. Work continued on improving the performance of NeRFs and in 2022 Nvidia released Instant Neural Graphics Primitives (AKA Instant NeRF) which drastically reduced the time needed to train a NeRF. It was a big enough innovation that Time magazine named it one of the best inventions of 2022.
Source: NeRF project page on Matthew Tancik’s website.
The transition from creation to generation is a necessary one to see the fulfillment of the dream of an Open Metaverse. Although it creates a strong technical disruption to the status quo, and will affect the way 3D modelers, 3D artists, animators and other 3D related artists work today it will have a democratizing effect on the generation of 3D assets, scenes and worlds. The reduced barrier to entry will allow for lower cost and more rapidly developed experiences and the commoditization of digital assets for all communities, especially economically challenged and marginalized communities who cannot afford high-powered computers, expensive software and extensive training.
The techniques I would see evolving would be audio-to-3D, audio-to-scene, and so forth. I don’t say text-to-something because voice will be the input tool of the Metaverse, except for long prescriptive text prompts that are intended to be loaded like code chunks. There have also been recent developments in in-painting, out-painting, neural style transfer, sketch-to-image which will evolve to sketch-to-3D as well as text-guided editing which could of course be voice-guided editing. These and novel methods yet to come will make the experience of 3D digital asset creation and world building a low friction experience for consumers.
3D Avatar Generation
There have been applications of discriminative AI (for classification tasks) used for 3D avatar generation which uses an image of a user’s face to try to best match facial features like shape and skin tone and match against a library of avatar assets. However generative AI will facilitate highly accurate avatar creation based on input images from the most photo-realistic representation all the way down to cartoonish appearances or any preferred style. The great thing about generative AI is that you can very easily perform a style transfer to convert your photo-realistic avatar into a style that matches the virtual world you enter, so as to remain style consistent and to abide by the aesthetic and safety rules of the virtual world. One could imagine themselves entering the Roblox virtual world in the Metaverse and their avatar instantly becoming a block-style avatar to fit the motif of the virtual world.
Dr. Karoly Zsolnai-Feher’s Two Minute Papers on 3D Style Transfer (2019)
Avatar generation will certainly be a very popular feature of the Metaverse and a starting point for those entering it. Being able to modify one’s avatar quickly without having to select from a limited library or request that a 3D modeler custom build one will be a must to provide Metaverse users with higher degrees of autonomy and avatar uniqueness.
Rigging
Avatars are not avatars without rigging. Neural methods have already proven themselves effective in predicting skeletal composition in order to rig 3D assets such as the work done by Xu et al. presented at SIGGRAPH 2020 entitled RigNet.
RigNet — Zhan Xu , Yang Zhou , Evangelos Kalogerakis , Chris Landreth , Karan Singh
3D Asset Generation
Although everyone in the Metaverse will start out with generating their own avatar (or perhaps a multitude of them) most of the Metaverse will be constructed of scenes and 3D digital assets. Users will have a need to build their homes, to generate digital accessories like wearables and aesthetics as well as utility assets such as a sword or a car. Recent advances beginning with UC Berkeley Phd student Ajay Jain’s work in 2022 on Dream Fields and then later in the year on DreamFusion with Ben Poole, Ben Mildenhall and Jon Barron from Google demonstrated the 3D assets could be generated from a diffusion model trained on 2D images and a trained NeRF to produce impressive results. Nvidia shortly after released a paper on Get3D which instead of using a diffusion model trained on 2D images was trained on 3D models and produced high quality textured meshes.
Get3D model results produced by Berkeley Synthetic.
However there is a lack of 3D data and much of it is covered by licensing agreements that do not allow for training models and redistributing the results, so most work has been focused on 2D image inputs. Magic3D was released in late 2022 by Nividia and improved upon the work that was done on DreamFusion, however to date no code has been released to validate the claims made in the paper, but the quality of the outputs are impressive and will certainly lead to further innovations that will see the eventuality of photo-realistic, high resolution, high poly count 3D digital assets being generated through prompts.
Although resolution continues to be a computational challenge for 3D asset generation, neural methods can be applied to do what is called super-resolution or upscaling (and conversely can be used for downscaling) to produce dynamic LODs (Level of Detail) for avatars, assets and scenes. This will reduce the need for exorbitant amounts of asset storage and instead real-time inference can be used to change the LOD for a particular asset based on its distance as perceived by a 3rd party. Although at the time of writing it is more costly to perform super-resolution than it is to just store multiple copies of a 3D asset in the network.
3D Scene Generation
Virtual worlds will consist of scenes and these scenes can be thought of as the web pages of the Metaverse. Just as the Web is mostly comprised of web pages, the Metaverse will mostly be comprised of 3D scenes. It follows that to make the Metaverse a reality we need the ability to create 3D scenes with as little friction as possible, so that means low cost, low time and low complexity scene creation. This is where generative AI comes in to save the day, although manual and procedurally generated content will still be produced for and in the Metaverse (and authentic human-generated content is likely to be highly sought after due to its scarcity) there will be a need to rapidly generate scenes that can range from simple low resolution scenes all the way up to highly complex highly detailed photo-realistic scenes. Recent research performed by Apple called GAUDI shows promising results in this space, with the ability to generate 3D scenes by disentangling radiance fields from camera poses on image-conditioned (image prompt) or text-conditioned neural scene generation as well as unconditioned, text-to-scene generation
University of Texas 3D scene generation via unsupervised object synthesis.
3D Fashion
The global fashion industry is one of the most environmentally impacting industries, it accounts for up to 10% of global greenhouse gasses, about 85% of all textiles end up in landfills or end up in the ocean and the industry has a terrible track record when it comes to human rights violations including child labor. Being able to move fashion into virtual environments could only have a net positive effect and the industry is ripe for innovation. There has been growing interest in fashion for use in virtual worlds however it has not seen wide adoption in online games, but I would expect that to change with the realization of the Metaverse.
Text-guided textures are already a reality, and recently research has demonstrated the ability to create a 3D reconstruction of clothing from only a single input image. Advances in this methodology will allow users to replicate their favorite real-world clothing items for use on their avatars.