OpenAI Sora: Video Generation Models as World Simulators

A joke is circulating among investors today: “I can finally get a good night’s sleep, because I no longer have to worry about the video generation companies I’ve invested in being overtaken by others.”

Last month, during an interview with Jiazi Light Year, “AI one day, human world one year: My 2023 with AI | Jiazi Light Year,” I predicted the four major trends of 2024, the first of which was video generation. I didn’t expect it to come true so quickly. (Of course, the videos generated by Sora currently do not contain complex semantics, and it cannot generate in real-time, so there’s still a chance for others)

Multimodal large models can understand videos in real-time and generate videos containing complex semantics in real-time;
Open-source large models reach the level of GPT-4;
The inference cost of GPT-3.5 level open-source models drops to one percent of the GPT-3.5 API, alleviating cost concerns when integrating large models into applications;
High-end phones support local large models and automatic App operations, making everyone’s life inseparable from large models.

Video Generation Models as World Simulators

The title of OpenAI’s technical report is also very meaningful: Video generation models as world simulators. (Video generation models as world simulators)

The last sentence of the technical report is also well written: We believe that the capabilities demonstrated by Sora today indicate that the continuous expansion of video models is a hopeful path to powerful simulators, capable of simulating the physical world, the digital world, and the objects, animals, and people living in these worlds.

In fact, as early as 2016, OpenAI explicitly stated that generative models are the most promising direction for computers to understand the world. They even quoted physicist Feynman’s words: What I cannot create, I do not understand. (What I cannot create, I do not understand)

At the end of last year, I listened to a panel where several experts discussed whether large models need to be world models, and surprisingly, a few of them thought that large models could do well in generation without understanding the world. So I feel that the biggest gap in domestic large models is vision or research taste, what can be done, what cannot be done, which technical path is reliable, and which is not, many people’s research taste really differs greatly from OpenAI’s. They like to collect some small news about OpenAI, saying, GPT-4 probably used these tricks, we can follow them to avoid some detours. This is like the late Qing Dynasty’s “learning from the West to control the West,” still stuck in the “material” stage.

In 2018, when I was looking for a job after my Ph.D., I also interviewed with several autonomous driving companies. When I found out that the autonomous driving technology at the time was using tens of thousands of if-else, I had serious doubts about whether this technical route could achieve L4 autonomous driving. I said at the time, for example, if there’s something fallen on the road, whether it can be run over or needs to be avoided or braked, there must be a world model to understand the properties of various objects to make the right decision with a high probability. Unfortunately, in 2018, not many people thought a world model was possible, nor did they think it was necessary.

Some people saw the 4-second video below Sora and thought it seemed like RunwayML’s Gen2 could also produce similar quality videos. But actually, looking at the consistency and adherence to physical laws demonstrated in the details of other videos on the OpenAI Sora release page, it’s clear that Sora’s capabilities far exceed all existing video generation models.

The many stunning videos on the OpenAI Sora release page might have been seen by many. Many people might not read to the last chapter of the technical report, but I think the “Emergence of Simulation Capabilities” chapter is the essence of Sora.

Simulating Virtual Worlds

I think what best demonstrates Sora’s strength is actually the second to last set of videos in the technical report, which is generating Minecraft game videos from a given text prompt. If it’s not simply tweaking the Minecraft videos in the training data a bit for output, then it means Sora truly understands the Minecraft game, including the physical laws and common knowledge contained in the game’s physics engine.

The last set of videos in the Sora technical report is a failed example. The flow of water when the cup is broken clearly does not conform to physical laws. The technical report considers this to be a major limitation of Sora. This again shows that what Sora cares most about is whether the model’s simulation of the world is accurate.

The “Emergence of Simulation Capabilities” chapter, in addition to simulating virtual worlds, also discusses three important features:

3D Consistency

Sora can generate videos with dynamic camera movement. As the camera moves and rotates, characters and scene elements move consistently in three-dimensional space.

Temporal Coherence and Object Permanence

A major challenge for video generation systems has always been maintaining temporal consistency when sampling long videos. We found that Sora can often (though not always) effectively simulate short-term and long-term dependencies. For example, our model can continuously present characters, animals, and objects even when they are obscured or leave the frame. Similarly, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.

Interaction with the World

Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new brush strokes on a canvas that persist over time.

Sora is a Data-Driven Physics Engine

I’ve long said that video generation can be trained with real videos and game videos, the key is understanding the world model in the physics engine. Like many corner cases in autonomous driving are simulated in game-like simulators. Many people disagree with this Sim2Real approach, thinking that game scenes are different from the real world, game videos are garbage data, and definitely not as good as training with only real-world videos. I believe the key to video generation is not whether the model textures are detailed, but whether it understands physical laws and the properties of various objects.

NVIDIA research scientist Jim Fan agrees with me. He said on Twitter:

If you think OpenAI Sora is just a creative toy like DALLE, …then you need to rethink it. Sora is a data-driven physics engine. It simulates many real or fantastical worlds. This simulator learned complex rendering, “intuitive” physics, long-view reasoning, and semantic understanding through some denoising techniques and gradient mathematics.

If Sora is trained with a large amount of synthetic data using Unreal Engine 5, I wouldn’t be surprised at all. It must do so!

Let’s analyze the following video. Text prompt: “A realistic close-up video of two pirate ships fighting in a cup.”

The simulator instantiated two exquisite 3D scenes: differently decorated pirate ships. Sora must implicitly solve the text-to-3D problem within its potential space.
These 3D objects are able to animate consistently while sailing, avoiding each other’s paths.
The fluid dynamics of the coffee, even the foam formed around the ships. Fluid simulation is a complete subfield of computer graphics, traditionally requiring very complex algorithms and equations.
Almost comparable to the realistic effects of ray tracing rendering.
Considering the small size of the cup compared to the ocean, the simulator uses tilt-shift photography techniques to give a “microscopic” feel.
The semantics of the scene do not exist in the real world, but the engine still implements the correct physical rules we expect.

Next: Add more modalities and conditions, and then we have a complete data-driven UE (Unreal Engine), which will replace all manually created graphics pipelines.

Later, Jim Fan added:

Apparently, some people don’t quite understand what a “data-driven physics engine” is, so let me clarify. Sora is an end-to-end, diffusion transformation model. It inputs text/images and directly outputs video pixels. Sora implicitly learns the physics engine in the neural network parameters through gradient descent, all of which is achieved through a large amount of video data.

Sora is a learnable simulator, or a “world model”. Of course, it doesn’t explicitly call UE5 in the loop, but it’s likely that UE5-generated (text, video) pairs were added to the training set as synthetic data.

Detailed Technical Analysis

For more technical analysis on Sora, you can see our co-founder @SIY.Z‘s answer: How do you view the latest release of sora by openai?

The Cost of Sora and OpenAI’s $7 Trillion Gamble

I’ve found that few people mention the cost of generating videos with Sora. I preliminarily estimate that generating 1 minute of video with Sora would cost tens of dollars, more expensive than Runway ML’s Gen2 (about $10 per minute).

Many people selectively ignore the cost. For example, when GPT-4 supported 128K context, few mentioned that using a 128K context once would cost $1.28. Today, Gemini 1.5 says it supports 10M context, and we still don’t know the cost of this 10M context. If video generation costs up to tens of dollars per minute, it can only be limited to professional film and game producers, and cannot be used to generate TikTok short videos.

This is why OpenAI is making a $7 trillion investment in chip manufacturing. Many people think Sam Altman is crazy, but I think he sees the real bottleneck of AI—computing power.

Currently, most of the cost of AI training and inference is still on GPUs. Many companies have encountered a GPU shortage when training GPT-4 level models, not to mention AGI. Those familiar with chip manufacturing can easily calculate that the selling price of chips like A100/H100 is about ten times the cost of TSMC’s wafer production. This high markup is partly due to the huge R&D costs of chips and software ecosystems, and partly due to monopoly pricing.

There was a time when FPGAs were also very expensive. When Microsoft decided to deploy FPGAs in every server in its data centers, it placed orders for hundreds of thousands or even millions of units with Altera, directly reducing the bulk order price of that FPGA to one-tenth of the retail price. Later, even Altera itself was acquired by Intel at a high price. We have a saying, as long as the production volume of chips is large enough, the chips themselves are as cheap as sand.

I also rented a basement 7 years ago and assembled dozens of various mining rigs to mine. The main cost of mining was electricity, not the cost of ASICs or GPUs. I mentioned in an interview in May last year that computing power is the key constraint on AI, now all data centers account for about 1%~2% of human energy consumption, and there has not been a significant breakthrough in human energy currently, so, whether the computing power constrained by current energy and chip technology can support such a large demand is a challenging question.

Currently, the energy consumed by AI computing power is only a small part of data centers. If the energy consumed by AI computing power accounts for 10% of human energy consumption, then there may be a need for 100 times the current energy consumption of AI chips, which far exceeds the manufacturing capacity of all chip manufacturers like TSMC.

So, someone might ask, making chips with $7 trillion, can the AI trained create $7 trillion in value? If you think AI is just creating the next mobile internet, then you’re thinking too small.

The real value of AGI lies in creating new forms of life, creating more efficient ways to convert energy into intelligence.

Currently, the human brain achieves a level of intelligence greater than a 10 kilowatt 8-card H100 inference server with less than 30W of power. But I believe, as large models and chip technology progress, silicon-based life will definitely have higher energy efficiency than carbon-based life. In this universe with limited energy, AGI not only may use energy more efficiently, spreading intelligence throughout the universe in the form of information more conveniently, but also may find the key to completely solving the energy problem. To create silicon-based life, AGI must be a world model, able to interact with the real world, and continuously enhance intelligence through autonomous learning.

Sora, This Familiar Name

Seeing the name Sora, it feels very familiar, MSRA’s software radio project is also called Sora, Sora means “sky” in Japanese. At that time, this project had a machine-translated Chinese webpage, which even translated Sora into “Sora Aoi”.

MSRA's Sora software radio project

MSRA’s 12th floor had a Sora Lab, filled with various software radio experimental equipment. The senior brother who worked on wireless reminded us, when entering the Sora Lab, never to mess with the antennas on the table, as if disturbed, it would take at least a few days to a week to readjust those antennas. Sometimes, when the space in the Sora Lab was not enough, they had to temporarily use the 12th floor’s large conference room for software radio experiments, and those who have stayed at MSRA should be very familiar with this large conference room.

Adjusting software radio Sora in MSRA's large conference room

Sora was the most advanced software radio platform at the time, and it had a significant impact, with many universities and research institutions using Sora for software radio research. My advisor, Bo Tan, even gave me a book he wrote, “Cognitive Software Radio Systems—Principles and Experiments”, which is about Sora.

I hope the name Sora can bring us a vast sky, spreading the seeds of civilization to every shining star in the sky.