A conversation between 5 founders of leading Chinese AI startups: GPT-4o, o1, FSD V12, scaling law, PMF and more
Are large language models hitting a plateau?
That question has been sparking heated debate in the global AI space lately. It’s a critical issue for rolling out AI applications and pushing towards the dream of Artificial General Intelligence (AGI).
OpenAI’s new o1 model has recently made waves, breathing new life into AGI’s development timeline. By merging reinforcement learning (RL) with large models, o1 takes AI's reasoning and problem-solving abilities to the next level.
Some see this as the biggest leap for large models since GPT-4’s 2023 debut. However, there’s a fair share of scepticism—some argue the progress isn’t as monumental as it’s being made out to be.
What’s certain is that this introduces new variables, leaving academia and industry players wondering: where do we go from here in the quest for AGI? And for AI startups, do the opportunities grow, or are they narrowing?
At the APSARA Conference, a few major voices joined the discussion. These included Zhang Peng(张鹏), creator of Founder Park; Yang Zhilin(杨植麟), founder of Moonshot AI; Jiang Daxin(姜大昕), founder of Stepfun(阶跃星辰); and Zhu Jun(朱军), chief scientist of Shengshu AI(生数科技).
During this roundtable discussion, the participants focused on the following key topics:
What objectively happened in the AI field over the past two years?
What does the release of OpenAI’s o1 mean for the industry?
What new demands does the reinforcement learning paradigm behind o1 impose on computing power and data?
How should AI application-level entrepreneurship be approached today?
What is the development path for AI technology and applications over the next 18 months?
Jiang Daxin pointed out that GPT-4 helped demonstrate System 1—fast, instinctive thought processes—while o1 showcases System 2—deeper reasoning and thinking. This pushes AI’s potential even further.
Meanwhile, Yang Zhilin mentioned that interactions between different data types are getting more advanced, and overall AI progress is speeding up. He also emphasized that it’s crucial for AI startups to identify where general models like GPT-4 fall short and adapt. The introduction of o1 could shift how AI products are built, requiring entrepreneurs to find a new Product-Market Fit (PMF) that balances user experience with output quality.
Zhu Jun brought up the different stages of AGI development, known as L1 to L5. According to him, we’re still in the early L2 phase, but AI is advancing faster than many expected. He believes that an L4 breakthrough could happen in the next 18 months and argues that many predictions about AI’s future are overly conservative.
Full transcript of the roundtable:
Zhang Peng: Today, I’m really honored and excited that we have the opportunity to be here at the Yunqi Conference, and also to discuss the progress of model technology with several pioneers in the field of large models in China. Earlier, during Mr. Wu’s speech, I believe many people felt his strong confidence in large models and the development of AGI. He even explicitly pointed out that this is not just an extension of mobile internet but may actually signify a new transformation of the physical world.
Of course, in this session, I think we need to deconstruct his conclusions a bit. The first step is to examine the actual progress in model technology. What have been the advancements over the past 18 months, and what can we expect in the next 18?
Let's start by reflecting on the developments so far. It's been about 18 months since ChatGPT’s release, which brought global understanding of AGI to the forefront. What are your thoughts? Is the development of model technology accelerating or slowing down?
We’re typically the ones watching you all play the game, but today we’ve brought the players on stage to hear their thoughts.
How about we start with you, Daxin? Could you share your view on whether AGI development has accelerated or slowed down in the past 18 months?
Jiang Daxin: In the past 18 months, it feels like the pace has been accelerating rapidly. Looking back at the numerous significant events in AI, we can evaluate it from two dimensions: quantity and quality.
From a quantity perspective, almost every month we've seen new models, products, and applications emerge. For instance, OpenAI released Sora in February, which caused a stir, followed by GPT-4o in May, and then just last week, we saw o1.
From a quality standpoint, I would say three events left a deep impression on me:
The first significant event is the release of GPT-4o. This marked a major step forward in the realm of multi-modal integration. Prior to GPT-4o, there were isolated models like GPT-4v (a vision understanding model), and generation models such as DALL·E, Sora for visual content, and Whisper and Voice Engine for audio processing. However, GPT-4o brought these previously separate models together in a unified framework.
Why is this integration crucial? Our physical world is inherently multi-modal, meaning it incorporates various types of data—visual, auditory, textual, and more. By merging these different modes, AI can better model and simulate the complexities of the physical world.
The second notable event is Tesla's release of FSD V12 (Full Self-Driving). This model stands out because it is an end-to-end large model that converts sensory input directly into control actions, representing a real-world application that bridges the digital and physical worlds. The success of FSD V12 not only demonstrates the potential of autonomous driving but also sets a direction for how future intelligent devices might interact with large models to explore and engage with the physical environment.
The third key event is OpenAI’s release of o1, which, for the first time, demonstrated that a language model can possess the human-like System 2 (slow, deliberate reasoning) capabilities. This is a significant breakthrough because System 2 reasoning is a foundational requirement for summarizing and understanding the world.
We have always believed that the development path of AGI can be divided into three phases: simulating the world, exploring the world, and summarizing the world. Over the past few months, we have seen significant breakthroughs in each of these phases—GPT-4o, FSD V12, and o1 respectively represent advancements in these areas. More importantly, these developments provide a clear direction for future progress. Therefore, in terms of both quantity and quality, the progress made is truly remarkable.
Zhang Peng: We’ve seen significant breakthroughs and progress in the areas we were looking forward to. What’s your take, Zhilin? As someone deeply involved, your perspective is probably different from those observing from the outside.
Yang Zhilin: I believe AI is still in a phase of accelerated development, and we can look at this from two perspectives.
The first perspective is the vertical dimension, where the intelligence of AI continues to improve. This is mainly evident in how models respond and how well they perform in tasks like text processing.
The second perspective is the horizontal dimension, where AI is advancing beyond just text models. As you mentioned, Zhang, various other modalities are evolving as well. These modalities are broadening the capabilities of models, enabling them to perform more tasks. This horizontal growth complements the vertical improvement in AI intelligence.
From both dimensions, we've seen major progress. For example, in the vertical dimension, AI’s intelligence has been consistently improving. Last year, for instance, AI struggled in tasks like generating images or competing in math contests, but this year, it’s scoring over 90%.
Coding capabilities have also significantly improved, with AI now outperforming many professional programmers. This advancement has also led to new applications, such as tools like Cursor, which allow users to write code directly through natural language commands. These developments are a result of the rapid progress in AI technology.
If you look at specific technical metrics, like the context length that language models can handle, last year, most models could only support 4-8K tokens. Today, 4-8K is considered very low, with 128K being the new standard. There are even models that can support 1M or even 10M tokens, which serves as a foundation for further improvement in AI intelligence.
Recent advancements haven't only been about scaling up models. While scaling will continue, many recent improvements have come from optimizing post-training algorithms and data. The shorter optimization cycles are accelerating the overall pace of AI development. For example, many breakthroughs in mathematics we've seen recently are largely due to advancements in technology.
Horizontally, we’ve also seen breakthroughs, with Sora being one of the most impactful in video generation. Many new products and technologies have emerged recently. Today, you can even generate realistic podcasts or two-person dialogues based on research papers, with the output being almost indistinguishable from real content.
This ability to translate, interact, and generate across different modalities is becoming increasingly sophisticated, and I believe AI is undeniably still accelerating.
Zhang Peng: It seems like these technologies are still accelerating in terms of the changes and innovations they bring. While we might not yet see the emergence of another “Super App,” if we step away from that mindset and focus more on the technology itself, we can better appreciate the real progress being made. This might be a more rational and objective perspective.
Professor Zhu Jun, how would you summarize the developments in AGI technology over the past 18 months? What milestones or key progress points stand out?
Zhu Jun: In the realm of AGI, the primary focus remains on large models. Both last year and this year, we’ve witnessed numerous significant changes. Overall, I agree that progress has been speeding up.
One aspect I’d like to add is that the pace of development is indeed accelerating. We’re seeing steeper learning curves. For example, language models started gaining attention around 2018, and Zhilin was among the earliest pioneers in this field. From then until now, we’ve seen about five or six years of rapid advancement.
Starting last year, in the first half, people were still primarily focused on language models. But in the second half of the year, attention shifted towards multi-modal systems—from understanding multiple modalities to generating them.
Looking back, the most significant developments have likely been in image and video generation, particularly in video. The release of Sora in February took many people by surprise, sparking debates about how non-public technology could push the boundaries so far.
Remarkably, it took the industry about half a year to deliver usable, consistent products with solid performance in both time and space alignment. This half-year journey reflects the growing maturity of our technical understanding.
The acceleration, in my view, stems largely from how well-prepared we are in terms of technical knowledge and infrastructure. We now have better access to cloud infrastructure and computing resources, which wasn’t the case when ChatGPT first emerged. Back then, there was a learning curve as many struggled to adapt to the technology, and it took a long time for people to fully understand and embrace it.
Now, as we become more adept at handling these technologies, new problems are being tackled at an increasingly rapid pace. Of course, this speed varies across different sectors, and the extent to which these advancements reach end users also differs depending on the industry.
In broader terms, while some people may not yet perceive the extent of the change, from a purely technical standpoint, the development curve is becoming much steeper. Looking forward, I remain optimistic about even more advanced AGI developments, which could occur even faster than previously anticipated.
Zhang Peng: From your perspectives, if someone from outside says that AGI development is slowing down, I think the simple response would be, "What more do you want?" The past 18 months have been incredibly fast-paced for all of us. On that note, the recently released o1 model has created quite a buzz in the professional community, sparking rich discussions. Since we are all here, and each of you has your own insights on this, let's discuss it.
Daxin, how do you view o1? Many people believe this is a major leap forward in AGI development. How should we understand this stage?
Jiang Daxin: Indeed, I've seen differing views. Some believe it's highly significant, while others think it’s not such a big deal.
If you try o1, your first impression will likely be how impressive its reasoning abilities are. We tested numerous queries, and its reasoning clearly took a significant step forward.
In terms of its significance, I see two main points. First, it’s the first time a large language model (LLM) has demonstrated System 2 capabilities—slow, deliberate thinking, similar to human reasoning. Previously, models like GPT followed a paradigm of predicting the next token, which meant they were constrained to System 1—fast, instinctive thinking. However, o1 incorporates reinforcement learning (RL) into its training framework, enabling it to tap into System 2 abilities.
System 1 thinking is linear. While GPT-4 could break down complex problems into smaller steps, it still followed a straight path. In contrast, System 2 allows exploration of multiple pathways, self-reflection, error correction, and iterative trial-and-error until the right solution is found.
So, o1’s integration of imitation learning and reinforcement learning means the model now possesses both System 1 and System 2 thinking, which is a monumental breakthrough.
Second, o1 introduces a new direction in scaling. It seeks to answer the question: how do we scale RL for general use? OpenAI isn’t the first to explore RL—DeepMind has used it for breakthroughs like AlphaGo and AlphaFold. However, previous RL applications were limited to specific tasks, such as playing Go or predicting protein structures. o1 has taken a huge step toward making RL more generalizable and scalable, which I consider a new paradigm in AI development.
Interestingly, o1 isn’t yet in a mature stage—it's just the beginning. But that’s what makes it exciting. It’s as if OpenAI is saying, “We’ve found a path with tremendous potential.” When you think through the methods behind it, you’ll believe this approach can indeed go further.
In terms of capabilities, o1 shows that LLMs can definitely exhibit System 2 reasoning. Technologically, it presents a new scaling paradigm, making it a significant milestone in AI.
Zhang Peng: It sounds like, despite some differing opinions, you’re very optimistic about its potential. What about you, Professor Zhu Jun? How would you evaluate o1 and its progress at this stage?
Zhu Jun: o1 represents a clear qualitative leap. In AGI development, we’ve categorized progress into levels from L1 to L5:
L1: Basic chatbot applications, like ChatGPT, which many are already familiar with.
L2: Advanced reasoning, capable of solving complex problems.
L3: Intelligent agents, which interact with and change the physical world.
L4: Innovators that can discover and create new knowledge.
L5: Organizers that can coordinate and optimize systems efficiently.
Each level has both narrow (task-specific) and general capabilities. In some specific tasks, o1 has already achieved L2-level reasoning akin to high-level human intelligence.
From a technical standpoint, o1 scales up what we’ve already been doing with reinforcement learning and integrates it into a large-scale base model. This is a breakthrough that has practical impacts across the industry.
Looking ahead, this will spark further exploration. I believe we’ll see rapid developments as researchers shift from narrow, task-specific capabilities to more general intelligence. The groundwork has been laid, and I expect o1 will push L2-level capabilities even further, and perhaps pave the way for higher levels of AGI.
Zhang Peng: You’ve set a high bar with your clear definition of AGI development stages at the L2 level. Of course, to reach Wu’s vision of embracing and transforming the physical world, we need to advance toward L3. That’s where things will become truly comprehensive and systematic.
Let’s turn to Zhilin. After the release of o1, Sam Altman enthusiastically called it a paradigm-shifting revolution. Sam is always good at delivering speeches, but how do you see it? Do you agree that it’s a paradigm shift?
Yang Zhilin: o1 is indeed significant, primarily because it raises the upper limits of AI. The core question is whether AI can improve productivity by 5-10% or lead to a 10x boost in GDP. The key issue is whether reinforcement learning can continue to scale up, and I believe o1 proves that it can push the boundaries of what’s possible with AI.
If we look back at the 70-80 year history of AI, the only consistent principle has been scaling—the idea that adding more computational power leads to breakthroughs. But before o1, many were researching reinforcement learning without a clear answer to whether it could continue to scale alongside pre-training and post-training in large language models. GPT-4 scaled predictably, but o1’s improvements aren’t as linear or certain.
Previously, people were concerned that the best quality data on the internet had already been used up. Even with more data, there seemed to be diminishing returns. The question was, how do you continue scaling? o1 goes a long way toward answering that question and proves it's at least feasible to continue. As more people get involved, there’s real potential for 10x productivity gains—a major breakthrough.
This shift will likely impact many industries, especially in creating new opportunities for startups. One key change is in the balance between training and inference. Training computation may continue to rise, but inference will likely grow faster, opening up new opportunities, especially for startups that can capitalize on this shift.
For companies with sufficient computational resources, this creates room for algorithmic innovations that could significantly improve base models. Even for those with fewer resources, post-training optimization in specialized domains allows for peak performance, driving the development of new products and solutions.
In short, o1 has expanded the imagination space for startups, especially in AI-related fields, by creating more potential opportunities and directions for growth.
Zhang Peng: This so-called paradigm shift revolves around resolving the question of what to scale next in the scaling law and how to expand capabilities. We’re seeing a new path emerge, and as you mentioned, the avenues for innovation and exploration are expanding, rather than being constrained or facing roadblocks.
Clearly, you’re all quite excited about the changes brought by GPT-o1, but I know everyone is also curious about one key question: Is this new paradigm leading to generalization? Right now, it excels in specific tasks, with very impressive improvements, but will it generalize across broader capabilities? Is there a clear path to this, or is it still uncertain?
Professor Zhu Jun: That’s an excellent question, and it requires careful consideration. Typically, breakthroughs occur in specific tasks, and then we explore how to expand those breakthroughs to more general tasks and capabilities.
From an RL (reinforcement learning) perspective, we’ve seen progress in fields like AI for traffic systems, although it hasn’t fully solved the problem of generalization. Based on research and development so far, we can see where the technical path might lead.
In contrast, more open systems like ChatGPT still face challenges in generalizing beyond certain applications.
One significant challenge in RL is process supervision data collection. Unlike result-oriented supervision, where the final outcome is labeled, in process-oriented supervision, each step of thinking must be labeled, and that’s a lot more difficult. It requires expert annotators to provide high-quality labels.
Additionally, in cross-domain transfer, such as self-driving cars or other open-ended scenarios, the reward model becomes more complex. For well-defined tasks like theorem proving or coding, the reward function is straightforward, but for open-ended applications like autonomous driving or creative content generation, the standards for “good” or “bad” outcomes become ambiguous and highly subjective.
In such situations, several technical challenges arise, such as how to define reward models, how to collect high-quality data, and how to effectively scale these systems. However, we can already see early signs of success, and the direction of future exploration is becoming clearer.
Given our improved infrastructure and simulators, we might achieve cross-domain transfer faster than we did in the past. For example, while AlphaGo took years to apply its techniques to other fields, today, with better simulators and AGI methodologies, we can build environments faster and achieve progress more easily.
From my point of view, while we don’t yet have a clear generalization path, the potential for exploration is immense.
Zhang Peng: I want to follow up on that with Zhilin—it seems like you agree with this assessment. But from the perspective of an entrepreneur like you, is this uncertainty a good or bad thing? When you look at these developments, how do you feel about them and how do they influence your strategy moving forward?
Yang Zhilin: I actually see this as a great opportunity. There’s a new technical variable, a new dimension to explore. We’ve already made some investments in this space, and now it feels like we’re forming a new structure around it. Within that framework, I see plenty of new opportunities emerging.
On one hand, there’s the generalization issue that Professor Zhu mentioned. On the other hand, there are still fundamental technical problems that need solving.
For instance, we’re dealing with the scaling of both training and inference simultaneously, which presents unique challenges. Some aspects of these processes haven’t been fully explored yet, including the issues around process supervision. There’s also the risk of hallucinations affecting the model's performance, which requires attention and research.
But if we can address these challenges, we’ll likely push the current capabilities to a new level. As I mentioned earlier, fundamental innovations could provide us with the early breakthroughs we need, allowing us to move forward faster.
Zhang Peng: Uncertainty is actually a good thing, right? Having a clear direction but uncertain paths gives startups an edge. Otherwise, there wouldn’t be opportunities for startups.
Jiang Daxin, I’d like to shift back to something that Zhilin brought up. We often talk about the algorithm, computing power, and data triad as the key elements of AGI. This time, we’ve seen paradigm changes in the algorithm. How do you think these changes will impact computing power and data in this triangle?
Jiang Daxin: The relationship between algorithms, computing power, and data is still highly interconnected. RL has indeed introduced changes to the algorithmic aspect, but the impact on computing power can be viewed in two ways: one is certain, while the other remains uncertain.
First, what's certain is that, as both Professor Zhu and Zhilin mentioned, the demand for inference-time computing will skyrocket. This is what OpenAI highlighted in their blog post—inference-time scaling will become crucial.
At the same time, this will drive higher requirements for inference chips. Imagine that GPT-o1 likely relies on H100 GPUs, with each query potentially taking several seconds to process. If we want to speed up inference, chip performance will need to improve significantly.
The second certainty is that during RL training, the need for computational power won’t decrease—in fact, it will continue to grow non-linearly.
Why? Because during the RL phase, especially with self-play, data can be generated indefinitely. As mentioned earlier, self-play data can theoretically scale without limit. For instance, when OpenAI was training its strawberry-picking model, they used tens of thousands of H100 GPUs for several months. Since GPT-o1 is still in its preview stage and training isn’t fully complete, the computational costs are still substantial.
If we aim for a generalizable inference model rather than one tailored to a specific use case, the computational cost for training will remain high.
An uncertain factor is whether we need to continue scaling the model parameters during self-play to achieve better inference paths.
Currently, there’s a common belief that, after reaching the trillion-parameter mark with GPT-4, scaling further yields diminishing returns. But if this method acts as an amplifier, potentially doubling the benefits, then the ROI (return on investment) could still be positive. This is something we need to verify further.
If this is proven true, the demand for computing power might return to a quadratic growth trajectory—where computational load equals parameter count multiplied by data volume. In my view, RL will drive increasing computational demands, both on the inference and training sides.
In terms of data, RL mainly relies on two types of data: small amounts of human-generated data and large amounts of self-play-generated machine data. While the quantity of data is important, quality is critical.
Therefore, how we design data generation algorithms, as well as the capabilities of the base model during self-play, will be key factors going forward.
Zhang Peng: I think we’ve dissected the paradigm shift brought by GPT-o1 quite well, but today all three of you are entrepreneurs, running your own companies, and moving your teams forward. Let’s take it a step further and think ahead.
Professor Zhu Jun, with the recent technological advancements, do you see any concrete push towards turning these technologies into products or industry applications? Have there been any key stages or observations?
Zhu Jun: Right now, I believe large models, or what we call large-scale pre-training technology, represent a complete paradigm shift. This shift goes beyond just language to include multi-modal capabilities and spatial intelligence. The key lies in making intelligent agents interact with the world and learn in the process.
Zhang Peng: I’d like to ask Zhilin as well. Kimi has become quite the attention-grabbing product this year and is progressing well. How do you see these AI innovations impacting the product landscape? Could you walk us through your thought process on this?
Yang Zhilin: Great question. We’re still in the early stages of industry development, and during this phase, technology drives product development more directly. As a result, product development often revolves around maximizing the value of emerging technologies. So, we take stock of the latest advancements and reverse-engineer our product strategies accordingly.
There are several key points with current technological advancements:
First, there’s plenty of opportunity to explore new PMFs (product-market fits).
For instance, because System 2 thinking (slow, logical reasoning) introduces latency, this can be a negative experience for users who want immediate results. We need to find a balance between delayed user experience and better end outcomes.
Secondly, this new technology offers better output and can handle more complex tasks. So, finding the right PMF involves balancing quality improvement with the experience.
I also believe that product formats will evolve. Since System 2 thinking requires more time to complete tasks, AI products may shift from instant chat interfaces to something more like an assistant capable of taking longer, more deliberate actions. This shift opens up a world of new possibilities for product design.
Zhang Peng: Earlier, we discussed the changes GPT-o1 brings. However, AGI (Artificial General Intelligence) is also evolving in other areas, like embodied intelligence with autonomous driving and robots. Professor Zhu Jun, do you see any clear movement in these areas?
Zhu Jun: Yes, I do. The paradigm shift is visible in multi-modal learning, where AI agents can interact and learn from the environment. Whether it’s AGI or robotics, the core ability here is decision-making in a world full of unknown variables. This is a key component of intelligence.
All the advancements we’re seeing, whether it’s GPT-o1, video generation, or 3D modeling, ultimately point to two directions: 1) Consumer-facing digital content – products that look and feel natural, tell stories, and even enable interaction. 2) Real-world applications – boosting productivity, especially through robotics.
A great example is how pre-training models are enhancing robot capabilities. Take our work with quadruped robots, for instance. Previously, they required manual tuning for each environment. Now, with AI-generated synthetic data and large-scale training, we’ve developed strategies that allow these robots to adapt to various terrains with the same set of strategies, as if they’ve swapped in a new “brain.”
So, with L1 and L2 stages advancing, we’ll soon see robots progressing through the L3 phase, becoming more adept at planning, interacting, and completing complex tasks in real-world environments.
Zhang Peng: Now, as entrepreneurs, you’re all navigating the current industry landscape. Daxin, how has your mindset shifted in the past 18 months, especially with the launch of GPT-o1? Do you see more opportunities for innovation in the future?
Jiang Daxin: Looking at it from two perspectives, innovation with reinforcement learning (RL) is fundamentally different from the GPT paradigm. GPT has been about predicting the next token since GPT-1 launched in 2018. Other than adding a mixture of experts (MoE) model, there hasn’t been much that’s brand-new in GPT.
However, GPT-o1 marks the beginning of a new era. Both Zhilin and Professor Zhu touched on key questions around RL—how do we combine it with large models and achieve generalization? These are uncharted territories, and I believe we’ll see rapid advancements in the near future.
For startups, this opens up many opportunities for innovation. At the same time, though, computing power presents a major challenge. On both the inference and training sides, the computing power required is enormous, especially when we’re striving for generalizable inference models.
We often joke internally, “Without GPUs, there’s no love. With GPUs, love becomes expensive.” But when your end goal is AGI, you have to stay the course, no matter the cost. As scaling law continues, fewer players will be able to compete due to the massive resources required.
Zhang Peng: Are resource barriers lowering at all, or are we still in a race to maximize computing resources? How are you managing and integrating resources at a large scale?
Jiang Daxin: There are two different approaches to innovation. One is building base models for AGI, which demands huge investments. The big players globally are investing tens of billions of dollars annually. The second approach focuses on applications.
With agents and reinforcement learning, the technology is now able to solve many of the problems we face today. As Zhilin said, the ceiling for AI capabilities has risen dramatically with GPT-o1, which means there’s still plenty of room for applications and innovation.
Zhang Peng: Zhilin, you’re working on consumer-facing products. I’ve recently heard investors focusing on metrics like DAU (daily active users) and retention rates when evaluating companies. If you weren’t an entrepreneur but rather an investor with a strong technical background in AI, what data would you look at to make an investment decision?
Yang Zhilin: First, I think metrics like DAU and retention are definitely important. But I’d break it down into several layers:
Value: First, does the product solve a real need? This is fundamental and isn’t necessarily AI-specific.
Incremental Value: Beyond basic value, is the product offering incremental value over existing AI products? While I expect to see more general AI products like ChatGPT, there’s still room for opportunities outside of them. If your AI product can provide incremental value over something like ChatGPT, that’s a sign of promise. This incremental value could come from a different interaction model or a different resource allocation approach.
Market Growth: Lastly, the market for your product should be expanding as technology evolves, not shrinking. For example, prompt engineering may face diminishing demand over time. However, products that have proven PMFs but haven’t yet reached a mainstream audience could have growth potential as the underlying tech improves.
In summary: yes, data matters, but before looking at metrics, ensure the product logic is sound. If the logic holds and the data confirms it, you’ve got a strong investment opportunity.
Zhang Peng: What progress do you hope to see over the next 18 months?
Zhu Jun: Given how quickly AI is advancing, we tend to underestimate progress. In 18 months, I hope to see L3 AGI in action. We’re talking about world models, virtual-physical fusion, and decision-making capabilities improving significantly. In certain scenarios, we’ll move from being a Copilot to an Autopilot.
We’ll also likely see breakthroughs in L4, with AI advancing into scientific discovery and innovation. While these capabilities are still scattered, we need a system that can integrate them.
So, I’m optimistic that in the next 18 months, we’ll see significant advances, at least at L3, and perhaps even early signs of L4.
Zhang Peng: By the end of this year, do you have any updates or milestones you can share with us?
Zhu Jun: By the end of this year, we aim to provide a more efficient and controllable version of our video model. The goal is to allow users not only to animate a single sentence or image but also to tell a continuous story, maintaining consistency not just for characters but for objects and themes, with interaction capabilities as well.
Efficiency is key in managing the cost of computing power. If we want to serve a large user base, we need to reduce costs or it becomes unsustainable. Another priority is improving the user experience. Users want to express their creativity and need to interact with the system multiple times — to verify ideas and find inspiration. Our ultimate goal is real-time interaction, allowing users to experiment quickly.
When we reach this stage, I believe both user experience and user numbers will significantly improve. That’s our primary focus this year. Looking further ahead, over the next 18 months, we’ll likely enter the virtual-physical fusion space.
Zhang Peng: You’ve clearly set your goals for the next 3 and 18 months. Zhilin, what are your thoughts? You can talk about the next 18 months or even the upcoming three months.
Yang Zhilin: I think the most crucial milestone in the near term is open reinforcement learning. We should be able to interact with users in real environments, allowing the system to complete tasks and plan on its own. GPT-o1 has already shown that this direction has greater certainty, making it a key milestone in the path to AGI. It’s likely the last major problem to solve on that road.
Zhang Peng: So, the big question is: can we expect breakthroughs in the next 18 months?
Yang Zhilin: Absolutely. In AI, 18 months is a long time.
Zhang Peng: Indeed, looking back, the past 18 months have seen massive progress. Any updates you can share for the next three months?
Yang Zhilin: We’re focused on continuous innovation in product technology. Our goal is to become the world leader in at least one or two key areas. As for specific progress, we’ll share updates as they come.
Zhang Peng: You may not have revealed much, but I believe there are exciting developments ahead. Daxin, how do you view the next 18 months and the next three months?
Jiang Daxin: I’m eagerly anticipating further generalization of reinforcement learning. Another area I’ve been excited about for a long time is visual understanding and generation integration.
In the text domain, GPT has already achieved integration between understanding and generation, but in the visual domain, it’s much more difficult. Currently, visual understanding and generation models are separate. Even with GPT-4o, while many modalities are integrated, it still can’t generate video, which remains an unresolved issue.
Why is this important? If we solve the integration of visual understanding and generation, we could build a truly multi-modal world model. This world model would enable the creation of long-form video content, addressing the current limitations of technologies like Sora.
Moreover, it could be combined with embodied intelligence to serve as the brain of intelligent agents, helping robots better explore the physical world. This is something I’m really looking forward to.
Zhang Peng: Before the end of the year, what progress can we look forward to from your side?
Jiang Daxin: I’m looking forward to both model advancements and bringing more and better user experiences through our products. For example, we have a product called LeapAsk, where users can experience our latest trillion-parameter MoE model. It’s not only strong in science but also in creative writing, frequently surprising users. LeapAsk also has a new feature called PhotoAsk, where users can take pictures and ask about things like food calories, pet moods, or historical artifacts.
With Meta Glasses and Apple Intelligence launching this year, emphasizing visual interaction, we’re reflecting those capabilities in LeapAsk as well. We’re working to make this feature better and better.
Zhang Peng: We’ve gone slightly over time, but it feels like we’ve just scratched the surface. There’s so much more to dive into as AI continues its rapid advancement.
I’d like to thank all of you for sharing today, and thanks to everyone for listening!