BAAI Chairman Huang Tiejun: The Path to AGI Has Been Found
In February this year, a paper by the Beijing Academy of Artificial Intelligence (BAAI) titled “Multimodal learning with next-token prediction for large multimodal models” was published in Nature.
This marks the second time a Chinese large-model research team has published work in Nature following DeepSeek’s cover feature, and it is also the first time a domestic research institute in China has had a paper published in the journal’s main edition.
Most current multimodal models rely on separate processing pathways for text, images, and video, leaving open the question of whether a single unified approach is possible.
A recent Nature study from the Beijing Academy of Artificial Intelligence (BAAI) suggests that autoregressive modeling—the same next-token prediction paradigm used in large language models—may provide such a unifying framework.
Built on the multimodal model Emu3, the research shows that a purely autoregressive architecture can handle both perception and generation tasks at performance levels comparable to specialized state-of-the-art models. The unified framework also extends naturally to applications such as robotic manipulation and multimodal interactive content generation.
In an interview with The Intellectual, Huang Tiejun, chairman of BAAI and professor at Peking University, explained how Emu3 enables multimodal unification and discussed the broader technical pathway toward AGI.
The Path to AGI Has Been Found — The Next Step Is to Fully Explore It
The Intellectual:
AI capabilities have advanced rapidly in recent years. BAAI has been closely following developments in the field. Looking back at the breakthroughs of the past few years, what do you think was the truly decisive turning point?Huang Tiejun:
From 2018 to the present, people have discovered a technological path that actually works: the autoregressive route. This approach is based on the Transformer architecture and trains models by predicting the next token in a sequence. That was the most important breakthrough from zero to one, and this path leads toward artificial general intelligence (AGI).BAAI has always held a conviction: if the combination of Transformers and next-token prediction has completely worked for language models, can it be extended to all modalities of data—language, images, video, and even multimodal data such as vision–language–action (VLA)? From a methodological perspective, I believe this is entirely feasible.
When people talk about language, images, and video today, they are really referring only to the most common and easily understood forms of data. In fact, this method can accommodate any type of data, including data from different layers of the real world.
However, this is still just our belief. To truly realize it, we must continue testing it with real data. Technological innovation can only be filtered by time and proven by results.
The Intellectual:
You see 2018 as a turning point. What changed around that time?Huang Tiejun:
Before 2018, artificial intelligence was mainly human-designed intelligence. Humans designed the intelligence. Whether knowledge bases or expert systems, designers controlled every piece of logic behind the system almost like gods. This reflected a traditional scientific mindset: first fully understand the underlying principles, and then manually design a system based on those principles.After 2018, with the birth of the first generation of GPT, so-called generative AI emerged, and the methodology fundamentally changed. Many people interpret “generation” as the system producing text, images, or videos. But I prefer to interpret generation in a way similar to the emergence of life on Earth—namely evolutionary generation.
On Earth, life evolved from non-life, and from simple forms to complex ones. Is there an incentive mechanism behind that process? Of course there is. But we still do not clearly understand what that mechanism is. Life science and brain science have been studying these questions for many years, yet overall they remain a kind of “dark forest.” We are only gradually discovering the principles behind them.
The changes after 2018 are similar. People have found a feasible technological route toward general artificial intelligence: training models through data-driven methods so that intelligence emerges. Yet the interactions that occur within this technological route remain unclear.
The Intellectual:
You said the autoregressive route is the only path toward AGI. But there is much debate about the definition of AGI. What is your view?Huang Tiejun:
My view is that artificial general intelligence has already been realized to a certain extent.According to the traditional way of thinking, people feel it has not been realized yet, because the underlying principles have not been fully understood. How can we say it has been achieved if we do not understand how it works? But today’s large models already exhibit strong general capabilities. You can test them. In terms of capability, they are stronger than many humans. Under such circumstances, insisting that they are not general intelligent systems becomes somewhat unreasonable.
Changes in people’s understanding of AGI are also related to changes in AI concepts throughout history. The earliest definition of general intelligence focused on behavior, function, and performance—essentially the Turing Test. If a third-party evaluator cannot distinguish between a human and a machine during interaction, then the machine has passed the test. Today’s large models have already reached this level.
The term AGI emerged roughly in the late 1990s, only a little over twenty years ago. People generally interpret AGI simply as general artificial intelligence. But strictly speaking, the concept proposed in the 1990s was actually harder to realize—it assumed AI would need self-awareness.
If AGI means AI with self-awareness, then I believe it has not yet been achieved, or at least it remains an open question. But if we do not adopt such an overly strict definition, and instead define AGI as a system capable of performing a wide variety of tasks like humans—possessing generality—then I believe we already have it.
The Intellectual:
Why can the autoregressive route bring about such transformation?Huang Tiejun:
This method captures the essence of how intelligence evolves. “Predicting the next token” seems simple, but it actually touches the core problem of intelligence. Because every intelligent system essentially does one thing: it uses the past to infer the future.Animals rely on past experience to decide whether to flee. Humans infer economic trends from historical data. We read books in order to improve our ability to judge the future. The most basic function of intelligence is to increase the probability of making reasonable predictions in uncertain environments. The evolution of biological intelligence is essentially a process of increasing the probability of making correct choices.
This path contains two indispensable components. The first is the Transformer. If we draw an analogy with life sciences, it represents the “structural foundation.” In biology we say “structure determines function.” The kind of DNA an organism has determines its physiological form. In the AGI domain, the Transformer plays the role of that fundamental structure.
But structure alone is not enough. Intelligence evolves gradually through interaction with the surrounding world. The human brain works the same way: intelligence is not formed all at once but evolves as the environment changes. This is what we call function shaping structure—the pressure of the environment drives structural changes.
In artificial intelligence, this evolutionary process depends on data-driven learning. Large models learn patterns through autoregressive training, meaning they repeatedly predict the next token. Each prediction is an attempt. If the prediction is wrong, the model adjusts its internal parameters according to the data. If it is correct, those connections are reinforced. Under the influence of massive data, the model gradually learns the patterns of language, logic, and even multimodal information. The combination of Transformers and autoregressive training satisfies the fundamental conditions for the evolution of intelligence.
The Intellectual:
How exactly does next-token prediction work?Huang Tiejun:
Tokens are the basic units of natural language processing. They can be words, phrases, or roots; they can also be punctuation or artificially defined markers. Essentially they are symbols. There are two ways to understand the meaning of symbols. One is direct sensory experience. But AI has no body, so it can only learn meaning through relationships among symbols.Before 2018, early word-embedding methods used statistical co-occurrence relationships between words to map each word into a high-dimensional vector space. Words that frequently appeared together were placed closer together in this space.
But those representations were static. No matter what context a word appeared in, its vector representation remained largely unchanged. The model learned the average meaning of a word rather than its dynamic role in a particular context. In other words, it solved the question “what does this word mean?” but not “what does this word mean in this sentence?”
The emergence of Transformers changed this. Take the novel Dream of the Red Chamber as an example: clues and foreshadowing across dozens of chapters are interconnected. Understanding a character requires not only knowing their name but also examining their interactions and experiences. What Transformers can do is discover relationships between any two tokens within a given sequence. In the context of Dream of the Red Chamber, it means computing the correlations between any two characters in the entire book.
Human intelligence also works this way when reading a novel or long text: it builds relationships within context and repeatedly analyzes the logic. The model is essentially doing the same thing, only at a much larger scale and in much higher dimensions. It not only understands the content, but in many cases understands it more thoroughly than most human readers.
Therefore, when the model predicts the next token, it is not simply performing frequency statistics. It is invoking a highly complex structure that compresses the relationships within the entire context. Prediction is merely the surface manifestation; what actually happens is that the structure internalizes patterns and then infers future developments through those relationships.
Let AI Predict the Physical World the Way It Predicts Language
The Intellectual:
AI systems already demonstrate impressive capabilities. But many researchers argue that unless we can fully explain the internal mechanisms of these models, they cannot truly be considered general intelligence.Huang Tiejun:
To be honest, that is a typical bookish mindset. After DeepSeek caused a global sensation, DeepMind CEO Demis Hassabis commented that “DeepSeek may be China’s best AI model, but it has not demonstrated any new scientific progress.” That kind of criticism looks at technological innovation through the lens of traditional science.If we must draw an analogy, many major technological breakthroughs in human history followed the pattern of “technology first, scientific theory later.” When the Wright brothers built the airplane, aerodynamics was far from mature. The principles of flight had not yet been fully explained by theory. Yet airplanes still flew and changed the world.
AI development today is at a similar stage. Large models are closer to engineering innovations than to pure scientific exploration in the traditional sense. Through the methodology of “predicting the next token,” humans have already created intelligent systems with general capabilities. That practical success is undeniable.
Another point must be clear: intelligence itself is extremely complex. It cannot be reduced to a few rules or formulas. Simply because something does not conform to a particular theoretical framework does not mean we should deny the intelligence demonstrated by current models. That would be as absurd as refusing to acknowledge that airplanes can fly.
The Intellectual:
But if we never understand the principles behind large models, can this kind of technological innovation be considered rigorous science?Huang Tiejun:
Understanding the principles is not necessary. When I say “not necessary,” I do not mean that principles are useless or undesirable. I mean they should not be a prerequisite. Once people start emphasizing necessity, many assume that we must first invent a theoretical framework before moving forward. I believe that mindset actually limits people’s ability to make larger contributions, because their thinking becomes too rigid. I used to think that way myself, but eventually I liberated myself from it.We have discovered an effective methodology that can convert massive amounts of data into intelligence. That method works. The mechanisms behind it are questions for future scientific research, but they should not become a reason to stop technological innovation. We should not deny objective technological results simply because they do not fit within familiar scientific frameworks.
The priority now is engineering and scaling—pushing this path deeper and further. As for the scientific principles of artificial intelligence, future researchers will eventually uncover them.
The Intellectual:
If intelligence cannot be summarized into a few formulas, can we still establish benchmarks to measure its development?Huang Tiejun:
We can set measurement metrics, but as intelligence becomes more complex, the measuring tools must also evolve. True intelligence has infinite complexity. We cannot force it to fit within static standards. Any finite measurement only provides a small window into understanding it, not the full picture.The Intellectual:
You have repeatedly emphasized that large models are primarily a technological innovation. Yet top journals like Nature usually prioritize fundamental scientific contributions. Why did BAAI decide to submit the Emu3 research there?Huang Tiejun:
I hope to correct the biases of traditional natural science. Many people trained in natural sciences are constrained by their own ways of thinking. They are accustomed to studying objects that already exist in nature and discovering the laws behind them.Artificial intelligence is different. AI systems do not exist naturally; they must be created. They are technological inventions. In this sense, AI research is almost the opposite of traditional natural science. Applying the same mode of thinking from one direction to the other is fundamentally misguided.
Many people keep asking, “What are the laws behind artificial intelligence?” But laws can only be studied after the object itself exists. Life exists, so we can study the laws of life. But artificial intelligence systems are still being built. If we demand a complete theory before building them, we effectively block technological innovation.
If we wait until everything is theoretically understood before starting, we might not build AI even in 300 years. The history of technology never works that way. The normal pattern is technological breakthroughs first, scientific explanations later. First airplanes, then aerodynamics. First build artificial intelligence, then study the science of artificial intelligence. As the old saying goes: “When you understand what should come first and what should come later, you are close to the right path.” If the sequence itself is confused, using the standards of natural science to judge an entirely different direction is hardly something to be proud of.
Unifying Multimodality Through Autoregression
The Intellectual:
Your Nature paper argues that multimodal learning can be unified through the autoregressive approach. What do you see as the limitations of current mainstream multimodal models?Huang Tiejun:
When people talk about multimodality today, they often think of “multiple modalities”—simply combining vision, audio, and text together.For example, Transformers perform very well in text tasks but were not originally designed for multimodal problems. Image and video generation now largely rely on diffusion models, which generate high-resolution outputs through iterative denoising. For vision-language perception, many approaches combine CLIP encoders with large language models.
If the goal is to solve a specific problem within a single modality, designing specialized architectures can work quite well. But if every modality requires special patches and separate architectures, that cannot be called general intelligence. The real question is whether there exists a general route capable of handling intelligence across all modalities and all forms of data.
That is the value of the autoregressive route. It is also why we believe it is the core pathway toward AGI. Emu3 was developed under this philosophy. Our experiments show that even without diffusion models or hybrid architectures, a purely autoregressive model can achieve flagship-level performance in both perception and generation tasks.
The Intellectual:
Your paper mentions that Emu3 generates video purely through an autoregressive approach and performs comparably to diffusion models. What is the fundamental difference between these two approaches?Huang Tiejun:
Diffusion generates content itself, but it is not the evolutionary generation I mentioned earlier. The two are fundamentally different.The autoregressive approach is suitable for all types of data. By predicting the next token, it can model any form of information—images, videos, even robot actions. That is why we insist on the autoregressive route. It has strong potential to unify all modalities.
Diffusion models are excellent for generating images and videos. Their core idea simulates a physical diffusion process—like ink dispersing in water. Starting from noise and reversing the diffusion process produces an image or video. This method excels at generating visually realistic outputs, but it does not focus on the underlying relationships between objects in the scene. It is suitable for the relatively narrow domain of image generation.
When dealing with language or other abstract data, the situation changes. In language, words form complex semantic and structural relationships. Characters, events, and concepts in a novel form a massive interconnected network. This complexity far exceeds the interactions among molecules or pixels in the physical world. Diffusion methods cannot effectively model such relationships. They cannot capture deep logical connections between words or infer future developments.
The Intellectual:
Will future research extend this approach to other modalities?Huang Tiejun:
The answer is already in the paper. We converted Emu3 into a vision-language-action (VLA) model and directly tested it on robotic manipulation tasks. On the CALVIN long-horizon benchmark, this general approach performs just as well as models specifically designed for robotics.Another important point is that we directly use discrete encodings for vision, language, and action. Some other approaches require additional video training stages. This once again proves that autoregression is a universal logic. It does not require task-specific patches. Once the logic works, it naturally extends from perception and generation to embodied intelligence.
The work published in Nature was completed in 2024 based on the initial version of Emu3. By 2025 we released Emu3.5.
With this new version, we made a deeper discovery: as model parameters, data, and compute scale increase, the model begins to exhibit emergent abilities to understand and predict the dynamics of the physical world—spatiotemporal relationships and causal logic. This suggests that scaling laws do not apply only to language. When extended to the real world, which is even more complex and governed by physical laws, the same path still works.
The Intellectual:
Although Emu3 shows the potential of the autoregressive approach in multimodal settings, this is still an experimental path. What is still missing for building a true “world model”?Huang Tiejun:
Recently many people have argued that scaling laws are reaching their limits. I believe that is incorrect. What has reached its limit is simply language-related data.People talk about “world models,” but what exactly is the “world”? For a robot, does recognizing the world mean entering a room without hitting the table or grasping a cup with the right force? That is far from sufficient. The real world contains complex physical interactions. When you run into a wall, is it made of concrete or wood? If it is glass, can you pass through it? These kinds of physical and material properties are largely absent from today’s model training.
If we go deeper, interactions between atoms and molecules, or the hardness of concrete after it solidifies—are these not also part of the world? If they are, then scientific experimental data and molecular measurement data should also be used for training. Relying solely on the limited language and image data on the internet is not enough to support true general intelligence.
Even if we model everything humans currently know, we still have not exhausted the complexity of the world. The objective world is infinitely complex, and we can only continuously approach it. As long as that infinite complexity exists, and as long as we can introduce deeper scientific data, scaling laws will not reach an end.
The Intellectual:
BAAI has long supported scholars from both academia and industry. As an independent research institution, how does its work differ from research in universities or companies?Huang Tiejun:
There are things that universities cannot do—not because they lack the capability, but because the necessary conditions are not in place. Building a systematic, operational project requires a team, funding, and time. In universities, professors can explore theoretical questions on their own, but developing a complete system requires first securing funding and assembling a team, which can take a long time. The pace of AI development simply does not wait for you to slowly go through the funding process. By the time you spend a year securing funding, the technological direction may already have shifted.As for companies, they tend to be pragmatic. When a technical path has not yet been fully validated and remains more of a belief than a proven solution, companies are generally unwilling to invest heavily in trial and error. What businesses prefer is to take approaches that others have already tested and proven effective, and then quickly turn them into predictable products.
BAAI sits somewhere between universities and companies. We have relatively stable funding and teams. Once we reach a consensus that the autoregressive approach is the only universal path capable of handling all modalities, we simply move forward and build it. In engineering and technology, whether something works cannot be determined by persuasion—it has to be demonstrated through real results.
What we need to do is spend the time to build it. Once the path is proven to work, companies will naturally follow and invest much larger amounts of money to industrialize it.


