As a super-fan of “The Matrix” movie, I’ve always been fascinated by the idea of a super-realistic virtual world: a place where the virtual and real are indistinguishable, where every detail mirrors our reality. Given my background working in the AI field since 2018, I’ve observed numerous advancements that inch us closer to making such virtual realities conceivable.

In this blog, I’ll discuss some of these recent advances in artificial intelligence and consider how far we are from achieving a world akin to “The Matrix”.

Matrix = Visual Part + Intelligence Part

I believe there are two core problems needs to be solve for building a matrix-like world: the visual part and the intelligence part.

A simple, elegant caption looks good between image rows, after each row, or doesn't have to be there at all.

Rendering Techniques

One of the cornerstones of creating a Matrix-like world is how to trick human eyes with super-realistic rendering. From today’s perspective, there are three possible paths towards achieving realistic rendering: Computer Graphics Rendering, Neural Rendering, and 2D Rendering.

Computer Graphics (CG): This field employs predefined human rules to render 3D explicit assets, often based on the principles of physics to simulate the movement of real light. To enhance computational efficiency, these simulations typically incorporate various approximations. Traditionally, computer graphics are not differentiable; however, making these processes differentiable is a burgeoning area of research within the CG community. CG-based rendering has proven to be highly effective and successful. Indeed, many spectacular movies, including “The Matrix,” and video games like “The Matrix Awakens,” leverage advanced CG techniques.

Neural Rendering: Since 2020, Neural Rendering has emerged as a groundbreaking approach to realistic visualization. This technique utilizes neural networks to learn and replicate the rendering process. Notable developments such as Neural Radiance Fields (NeRF) and its successors, such as Gaussian Splatting, exemplify this advancement. By integrating neural scene representations with differentiable rendering, NeRF has demonstrated exceptional visual quality in novel view synthesis, surpassing traditional CG-based methods in some aspects.

2D Rendering: Traditional CG and Neural Rendering approaches primarily focus on converting 3D models into 2D images, regardless of whether the 3D models are explicit (e.g., CG, Gaussian Splatting) or implicit (e.g., NeRF). A novel method involves directly generating 3D-consistent 2D content without adhering to established 3D rules or rendering processes. An example of this is Sora, which, presumably without any 3D modeling, is capable of producing 3D-consistent visual content. In this approach, the conversion from 3D to 2D is implicitly managed by neural networks, eliminating the need for traditional rendering techniques.

Left: 3D Gaussian Splatting Shaders from Alexandre Devaux.
Right: Sora Video from Tansu Yegen.

All three rendering techniques present compelling advantages and disadvantages. A pivotal question arises: Which rendering method is most suitable for building “The Matrix”?

Computer Graphics: While CG has produced impressive visual content, integrating these systems with cutting-edge AI technologies proves challenging. Traditional CG systems are complex and often not fully differentiable, which hampers gradient propagation to neural networks— a crucial feature for learning from extensive data and understanding effective pixel arrangements to deceive human vision.

Neural Rendering: This approach leverages neural networks to emulate real-world rendering by learning from data. However, current neural rendering methods are nascent and limited, predominantly focused on per-scene optimization for static scenes. While they can produce impressive novel views, broader applicability remains a significant hurdle. Challenges include adapting to dynamic scenes and learning from vast datasets rather than memorizing specific scenes.

2D Rendering: Techniques like Sora differ fundamentally from the other methods by lacking an explicit 3D data structure, complicating 3D control. For instance, accurately reflecting real-world movements in the generated scenes remains problematic without a 3D intermediate representation.

Hybrid Approach: I advocate for a hybrid model combining Neural Rendering and 2D Rendering, maintaining an underlying 3D structure to leverage optimization and other advanced techniques effectively. Yet, integrating large-scale data learning remains an unresolved challenge. Some suggest Gaussian Splatting as a bridge, though I believe this adds unnecessary complexity, as it does not facilitate easy processing and generation by networks. A method maintaining the 3D structure while optimizing data and computational efficiency, akin to insights from The The Bitter Lesson, is essential.

Generative Modeling

Generative modeling is a transformative area in AI, crucial for the automatic generation of images, videos, and 3D assets. This field is pivotal in creating expansive virtual landscapes, where manually crafting or digitally scanning every detail of a large-scale environment is impractical. Generative AI can automate this process, populating extensive digital realms with complex, lifelike details.

Generation Target&Representation: The primary challenge in generative modeling is defining the generation target. Inspired by classical computer graphics, a system must understand the scene’s layout, geometry, textures, and materials. Additionally, for motion, information about skeleton binding and physical properties is required. The representation of these elements varies significantly; for instance, traditional geometry uses explicit meshes, whereas future geometries might utilize neural networks. This disparity necessitates different approaches in the design of generative pipelines.

Generation Methods: Currently, Diffusion models and Autoregressive Transformers are the leading methods in generative modeling. Each has unique strengths: Transformers excel at modeling discrete data distributions such as language tokens, while Diffusion models are adept at handling continuous and complex distributions like image tokens. It is still debated which method is superior, but advancements in fundamental architecture are expected to enhance their modeling capabilities and efficiency. Utilizing AI for generation appears to be a definitive trend, capable of fitting more complex data distributions and producing more diverse content than traditional methods.

Leverage Prior Models: A significant challenge in AI generative modeling is the requirement for substantial ground-truth data, which is often unavailable for many generation targets. The predominant data modalities available are images, videos, and audio. Therefore, the development of generative models must build upon these foundational models. For instance, generating 3D meshes often involves using an image foundation model (employing techniques like SDS loss) or a video foundation model (such as SVD for 3DGEN).

The Role of GPTs

While pre-mentioned generative tools can create a physical representation of a world, imbuing this world with a “soul” is a different challenge altogether. To elevate a virtual environment such as the Matrix to lifelike authenticity, we must consider how to infuse it with elements of intelligence and agency. Currently, text-based models like GPTs have demonstrated remarkable capabilities, suggesting their potential role in animating virtual worlds.

Left: DALL·E 3 generated AI Operation System.
Right: DALL·E 3 generated AI NPCs.

Matrix Operating System: Designing and constructing a world as complex as the Matrix is beyond human capabilities alone. Humans can establish initial parameters and rules, but building such a vast and intricate world necessitates the involvement of advanced AI systems. A GPT-like intelligence could serve as the foundation of this virtual world, functioning similarly to an operating system. It would not only comprehend the laws of the real world but also fully understand each generative tool at its disposal. This intelligence could then apply these tools strategically to construct and manage the virtual environment, making it akin to an advanced operating system that orchestrates all aspects of the virtual landscape.

Intelligent Agent: Another critical role for GPTs within this framework is to serve as non-player characters (NPCs), including avatars and simulated animals, which enhance the realism of interactions within the virtual environment. Developing these agents involves complexities comparable to creating ‘real’ humans. These intelligent agents need to perceive their environment accurately and interact with it in a manner that is both responsive and realistic. They should possess deep understanding of the virtual world’s dynamics, enabling them to perform actions and react in ways that are indistinguishable from real beings.

The implementation of GPTs in these roles could dramatically transform the experience of virtual worlds, providing not only the backdrop but also the interactive components that make the digital environment truly immersive and lifelike.

Conclusion

An evolving principle in current research is the prioritization of scaling laws from the outset. Traditionally, experiments began on a small scale, testing datasets and architectures to validate basic assumptions and identify errors—a necessary step given the cost and time constraints associated with large-scale experiments. However, the research paradigm is shifting. The tremendous benefits of scaling up have become apparent: by using more data and computation, models demonstrate remarkable generalization capabilities for real-world applications. Today, a more resource-intensive yet effective approach involves starting with larger datasets that encompass diverse scenarios, followed by focused methodological research. For those with limited resources, it remains acceptable to start small. Nonetheless, it is crucial to design experiments with scalability in mind. As Albert Einstein suggested, we should strive to simplify our approaches as much as possible, but no further, focusing only on the essential elements of design to ensure future scalability.

I am convinced that we are closer than ever before to realizing a Matrix-like world, a concept once confined to the realm of science fiction. The advancements in generative modeling, neural rendering, and AI-driven interactive systems have brought us to the brink of creating virtual environments that are both vast and detailed, blending seamlessly with elements of intelligence and realism. Yet, numerous challenges remain. We must continue to refine these technologies, improve their integration, and address the myriad technical and ethical questions that such complex simulations raise. As researchers, it is our prerogative—and indeed our responsibility—to navigate these uncharted territories, pushing the boundaries of what is possible. The journey is fraught with difficulties, but the potential to revolutionize our interaction with digital worlds is an exhilarating prospect. I look forward to contributing to and witnessing the evolution of this transformative technology.