DeepSeek-R1 has ignited a global frenzy in the AI community at an unexpected pace, yet there is a relative shortage of high-quality information about DeepSeek.
On January 26, 2025, a closed-door discussion on DeepSeek was held in China. Guests included dozens of top AI researchers, investors, and frontline AI practitioners who explored and studied DeepSeek’s technical details, organizational culture, and the short-, medium-, and long-term impacts following its breakout success.
This meeting aimed, under limited information, to lift a corner of the veil on this “mysterious Eastern force.”
It is important to note that this discussion was a private technological exchange and does not represent the views or positions of any particular individual or institution.
As Silicon Valley venture capitalist Marc Andreessen commented on DeepSeek-R1:
“As open source, a profound gift to the world.”
Thus, following the same spirit of openness, all participants have decided to publicly share the collective thinking from this otherwise closed-door event.
Below is a summary of the meeting’s key points.
The Mysterious DeepSeek
“The most important thing for DeepSeek is to push intelligence forward.”
Founder and CEO Liang Wenfeng is the core figure at DeepSeek. He is not like Sam Altman; he has deep technical expertise.
DeepSeek’s strong reputation largely stems from being the first to release reproducible versions of MoE, o1, etc. Their advantage lies in early action, but whether they can reach the absolute top remains to be seen. There is still a lot of room for improvement. Moving forward, they face resource constraints and must allocate limited resources to the most crucial areas. Nonetheless, their research capability and team culture are quite good. If they had 100,000 or 200,000 GPUs, they might achieve even more.
From preview to formal release, DeepSeek’s long-context capabilities advanced very rapidly. DeepSeek’s 10K context length was achieved through quite standard methods.
Scale.ai’s CEO claimed DeepSeek has 50,000 GPUs, but in reality it definitely has fewer than that. Publicly available information suggests DeepSeek has around 10,000 older A100 cards and possibly 3,000 H800s obtained before the ban. DeepSeek places great emphasis on compliance, so they have purchased no non-compliant GPUs. Hence, their actual GPU count should be limited. In contrast, GPU usage in the US can be rather extravagant.
DeepSeek pours all its resources into a narrow focus, abandoning many other directions such as multimodality. It is not simply serving people but rather pursuing intelligence itself—likely a key reason for its success.
In a sense, quantitative trading could be considered DeepSeek’s “business model.” Huanfang (another company founded by Liang Wenfeng) was a product of an earlier wave of machine learning. The real priority for DeepSeek is pushing intelligence forward; financial returns and commercialization have lower priority. China needs leading AI labs to pursue breakthroughs that can beat OpenAI. The path toward intelligence remains long. In 2025, the field is further diversifying, and surely new developments will emerge.
From a purely technical standpoint, DeepSeek, akin to a “Huangpu Military Academy,” greatly helps spread talent across the industry.
Even in the U.S., the AI lab business model is not particularly strong. Indeed, there are few solid commercial models in AI today, so solutions will likely appear later. Liang Wenfeng is ambitious: DeepSeek does not care about form; it is moving steadily toward AGI.
Reading the DeepSeek papers, one sees that many of their methods focus on cutting hardware costs. In key scaling directions, DeepSeek’s techniques can reduce overall expense.
In the long run, it may not drastically change the landscape of computing power, but in the short term everyone wants more efficient AI. Demand remains high, with most companies struggling to meet compute requirements.
On DeepSeek’s organization:
In investment, the typical strategy is to gather the very best talent, but observing DeepSeek’s approach—where a team of bright young minds from domestic universities gradually becomes “top-tier”—one wonders if simply poaching one or two of them would break their synergy. So far, it appears such a move might not greatly affect DeepSeek.
There is ample money in the market, but DeepSeek’s core is its organizational culture. This culture somewhat resembles ByteDance’s research culture—fundamental in orientation. Whether a culture is “good” depends on sufficient funding and a long-term outlook, which in turn requires a strong business model. Both ByteDance and DeepSeek have excellent business models.
Why has DeepSeek caught up so quickly?
Reasoning models demand higher-quality data and training. If they had to tackle long-text or multimodal data from scratch, it would be harder to catch up to a closed-source model. Yet the architecture for a purely reasoning-focused model has not dramatically changed. Reasoning is thus easier to chase.
R1’s rapid catch-up may be because the tasks themselves are not extremely difficult. RL (reinforcement learning) only helps the model refine its answers. R1 has not exceeded the efficiency of Consensus 32—it used 32 times the resources, effectively replacing a parallel process with a serial one. This doesn’t expand the intelligence boundary; it just makes achieving that level easier.
Explorer vs. Follower
“AI resembles a step function. Followers need 10x fewer resources.”
AI progresses in step functions: for followers, the required compute is an order of magnitude lower. Their costs remain relatively moderate, whereas explorers must train many models. Exploration will not cease, and plenty of people will invest in new directions such as productization. Beyond reasoning, numerous areas also demand large-scale GPU usage. Explorers spend heavily on GPUs, often out of sight, but without that significant investment, the next step might never arrive. Many feel existing architectures and RL methods are unsatisfactory and continue to push forward.
During exploration, having 10,000 GPUs does not necessarily guarantee better results than 1,000, but there is a threshold below which progress is implausible—with only 100 GPUs, the iteration cycle becomes too long to be practical.
Advancing physics, for instance, involves both university-based researchers (who freely explore without ROI constraints) and industry labs (focused on efficiency gains).
From explorer vs. follower perspectives, smaller companies must optimize efficiency due to limited GPUs, while larger firms focus on speed and are less concerned with specialized efficiency tricks that might be less stable at massive scale.
The CUDA ecosystem leads on the sheer variety of operators, while domestic companies like Huawei, as latecomers, target primarily the most commonly used operators. For a 100,000-GPU setup, balancing the cost of leading vs. following is no small matter. Meanwhile, China’s next avenue for catching up might be multimodality, especially since GPT-5 has not yet launched abroad.
Technical Detail 1: SFT
“At the reasoning level, SFT may no longer be necessary.”
The most surprising aspect DeepSeek has brought is not open source or low cost, but rather the possibility that SFT (Supervised Fine-Tuning) might not be needed for reasoning tasks. Tasks beyond reasoning may still require it. This raises questions about whether DeepSeek points to a new training paradigm or architecture that more efficiently leverages data and accelerates model iteration.
DeepSeek-R1 demonstrates that using SFT for distillation is highly effective. It is not fully absent of SFT, though. In DeepSeek-R1’s third step, only SFT is used; the final alignment step still employs RLHF.
R1 is fundamentally SFT-trained. Uniquely, its training data was generated by a model that had undergone RLHF, showing that complex approaches are not strictly required—SFT distillation alone can work, provided the strategy is robust.
GRPO essentially relies on a sufficiently smart base model, employing 16 generations per prompt so it can attempt several times to find the correct answer. A decent base model plus verification is the R1 concept. This is particularly suited for math and coding tasks because they are easy to verify. Theoretically, a similar approach can extend to other tasks, culminating in a generalized RL model.
R1-zero saw the emergence of Chain-of-Thought (CoT) without SFT, and the CoT grew progressively longer. This “emergence” is notable. SFT, in that sense, seems more like an accelerator. Even without it, the model can produce CoT; with it, the CoT capability matures faster.
This indicates that small-model startups can similarly do SFT distillation of larger models and achieve excellent results. However, that does not mean SFT has vanished from R1.
A single LLM with an unbounded CoT is theoretically akin to a Turing machine capable of solving extremely complex computational problems. CoT is the intermediate search result. One can keep sampling potential outputs until a correct one appears, then push the model toward more credible directions. Achieving this requires the model to perform internal computation; CoT is the necessary intermediate. The final correct output might be called “emergent,” but it is essentially the nature of the model as a computational system.
Although the DeepSeek paper does not explicitly discuss long context, between R1-preview and R1, there is a palpable improvement in context window. It is speculated they introduced some Long2Short CoT optimization. Possibly in the third-phase SFT they used CoT but later removed it at generation time. In the final release, they may have used a cleaner CoT dataset for SFT.
Types of SFT data vary: some is cold-start data providing the model with strong initial strategies so that RL exploration is more productive, while other data emerges from an RL-trained model, re-injected into the base model for SFT. Essentially, every domain has a data pipeline. The data’s strength is inherited from the base model, and distillation can be lossless. Combining different domains can lead to better generalization.
We are unsure about R1’s data efficiency. It is suspected OpenAI does something similar for fine-tuning. In R1’s third stage, they did not use the RL-trained model as the new base; rather, they generated data for SFT to produce R1. This dataset included 600K “reasoning” samples and 200K “non-reasoning” samples. Likely the second-phase model could also solve tasks beyond the official domain examples, producing reasoning data. The 200K non-reasoning set comes from the V3 SFT data. 800K total is small but evidently quite efficient.
Technical Detail 2: Data
“DeepSeek puts enormous emphasis on data labeling.”
Scale.AI need not fail in the short term. We still require RL in many domains—commonly math and coding. Expert annotation is still essential, though data labeling can grow more complex. There should remain a market for it.
For training, multimodal data so far has shown little proven effect or is prohibitively expensive. Currently there is no evidence of its clear benefit, though it might offer future potential.
DeepSeek is reputedly very serious about data labeling. Word is that Liang Wenfeng personally participates in tagging. Beyond algorithms and tricks, precise data matters tremendously in AI. Tesla’s labeling costs are nearly 20 times those of Chinese autonomous driving firms. Chinese companies first tried large-scale general data, then refined it—only to discover the need for highly experienced drivers, which Tesla had focused on from the start. Tesla’s robotics data was labeled by individuals with extremely steady “cerebellum” function, so their machine movements are exceptionally smooth. In China, the labeling was less smooth. DeepSeek’s high investment in labeling may explain its model’s higher efficiency.
Technical Detail 3: Distillation
“The downside of distillation is that it reduces model diversity.”
If one avoids tackling the main technical bottlenecks in model training by exclusively using distillation, one might miss crucial knowledge and fail to adapt when next-generation technologies arrive.
There is a capability mismatch between large and small models. Distilling from a large model to a small one is genuine teacher-to-student learning. If the teacher model is not proficient in Chinese, for instance, and you attempt to distill Chinese data from it, performance may degrade. In reality, however, small-model distillation does significantly raise performance. Once a distilled model undergoes RL, it can grow substantially—partly because it is being trained on data that originally mismatched its scale.
The clear drawback of distillation is diminished diversity, capping the model’s potential upper bound. In the near term, though, distillation remains viable.
Distillation can involve various hacks. Early RL was typically run on instruction-tuned models, which sometimes produce a series of irrelevant steps before suddenly generating the right answer—because RL hacks can be quite subtle, and the model might just be recalling memorized solutions from pretraining. This reveals the hidden risks of distillation. Without curated labels, a shift to RL with verifiable rewards (RLVR) might lead the model to adopt an even more trivial route rather than true reasoning. OpenAI has not fully solved this either—it’s a limitation of the current generation.
In the long term, if a team simply replicates another’s approach rather than devising its own architectural strategies, it could encounter unknown pitfalls. Given that we do not see a qualitative leap in “long context” yet, it may limit future performance. Potentially, R1-zero is a truer approach. Relying exclusively on existing solutions might be problematic; more varied exploration is desirable.
Other models can also achieve respectable results via distillation, so it is conceivable we will see teacher and student roles within future model ecosystems. Being a “good student” might itself be a profitable route.
In terms of distillation and broader technology trends, R1 is not as groundbreaking as AlphaGo, but in commercial or public impact, it has drawn more attention than AlphaGo did.
There are two distillation phases: if one only distills o1 or R1 (i.e., no home-grown system or verifiable reward), it leads to growing reliance on distillation alone. Yet truly general domains cannot be distilled—there is no suitable reward, and it is unclear how to obtain specialized CoT during distillation. Early-distilled models usually bear traces of their teacher (e.g., OpenAI’s “annealing”), and R1-zero’s reflective ability is closely tied to the base model’s prior training.
It’s hard to believe a model relying solely on Internet-scale data without “annealing” can exhibit these behaviors. The web lacks quality data at that depth.
Probably only a few top labs are systematically studying exactly how much annealing data is needed and in what ratio. Distillation is effectively a type of RL (with SFT representing behavioral imitation). However, SFT alone has a low performance ceiling and can harm diversity.
In venture circles, DeepSeek stirs excitement. If DeepSeek can maintain iteration momentum, smaller non-public firms will see great flexibility in adopting AI. They have also distilled smaller versions for mobile devices. If that direction proves viable, it greatly raises the ceiling for AI applications.
When employing distillation, the crucial question is the target. OpenAI does no data distillation; to surpass OpenAI, you certainly cannot rely exclusively on distillation.
In the future, we might need models to “jump” steps in their reasoning. With fixed context length, can the upper limit of model performance be pushed further?
Technical Detail 4: Process Reward
“Human supervision limits process-based reward; final-results supervision determines the model’s ceiling.”
Process Reward is not necessarily useless; however, it can be reward-hacked. A model might not genuinely learn but still score highly on the metric. In math problems, if the model produces 1,000 solution attempts and none are correct, RLVR is of no help. A decent process reward might guide it in a better direction, depending on the task’s difficulty and the reliability of the reward system.
If the process-based reward diverges from genuine correctness, it becomes easily hackable. Theoretically, process supervision could succeed if we manage reward assignment carefully. Currently, final-answer matching is the main approach. No one has a mature method to let the model self-evaluate without potential hacking, and letting the model iterate on its own is easiest to exploit. While enumerating all the process steps isn’t hard, it just has not been widely done. It may be promising.
The upper bound of process supervision is limited by humans, who may not imagine all possibilities. Result supervision truly governs the potential ceiling for the model.
AlphaZero thrives because the outcome (win or lose) is unambiguous and can be converted into a reward. In LLMs, the model cannot always tell if it should stop generating or if the final answer is correct. It’s a bit like a genetic algorithm. Though the potential ceiling may be higher, there is also a risk of failing to converge.
Moving from AlphaGo to AlphaZero was easier partly because Go’s rules are fixed. In LLMs, math and coding tasks are favored for RL because they are simpler to verify. But in less clearly defined domains, the model might follow the reward rules yet fail to produce the desired outcome.
Why Haven’t Other Companies Adopted DeepSeek’s Methods?
“Large companies keep a low profile.”
Perhaps OpenAI and Anthropic have not pursued DeepSeek-like approaches because they find other directions more valuable for their current compute resources.
In contrast to large tech companies, DeepSeek is not focusing on multimodality but purely on language, allowing them to deliver. Big firms also have strong capabilities but must stay low-profile and avoid revealing everything. Meanwhile, multimodality is not critical to boosting “intelligence” right now.
2025’s Technological Divergence and Bets
“Besides Transformers, can we discover new architectures?”
In 2025, models will further diverge. The most compelling goal is pushing intelligence boundaries, and there could be multiple breakthrough paths. Techniques might change, such as synthetic data or entirely new architectures.
In 2025, watch for research in architectures beyond Transformers that lower costs while pushing intelligence frontiers. Meanwhile, the full potential of RL has yet to be realized. Product-wise, the spotlight will be on “agents,” though at present they have limited large-scale application.
By 2025, multimodality may yield products that challenge ChatGPT’s current form.
R1 and V3’s approach—emphasizing lower costs while achieving high effectiveness—proves one viable path. This does not conflict with the alternative path of investing heavily in hardware and scaling parameters. Due to external constraints, China may be more inclined toward the former.
(1) DeepSeek still largely adheres to scaling laws given it started from a capable base model. (2) From a distillation perspective, DeepSeek’s “big-then-small” process actually benefits large closed-source models. (3) We still see no “anti-scaling” metrics that might refute scaling laws. If such metrics arise, they could strike at the heart of the scaling narrative. Also, anything open-sourced by DeepSeek could be replicated by closed-source models to reduce costs—potentially beneficial to them as well.
Reportedly, Meta is currently reproducing DeepSeek’s work. So far, it has not caused major changes to Meta’s infrastructure or long-term roadmap. In the longer view, cost will always matter in addition to pushing the frontiers of intelligence.
Will Developers Shift from Closed-Source Models to DeepSeek?
“Not so far.”
We have not yet seen major developer migrations away from closed-source to DeepSeek because the leading closed-source models provide strong coding instruction compliance. However, it is uncertain whether that advantage can be overcome in the future.
From a developer’s standpoint, Claude-3.5-Sonnet specifically trained for tool use is attractive for building agents. DeepSeek does not yet provide that, though it opens significant possibilities.
For users of large models, DeepSeek V2 already satisfies most requirements. R1’s faster speeds do not deliver drastically more value. In fact, for advanced reasoning tasks, it sometimes gets more wrong answers than before.
In practice, many adopt an engineering approach to simplify their problems. The year 2025 may mark widespread enterprise adoption with current capabilities, hitting a ceiling only later when existing intelligence levels prove insufficient.
RL currently handles tasks with clear-cut answers, not going beyond what AlphaZero did—arguably less so, as we rely heavily on solutions with known answers. Distillation can rapidly break new ground in such tasks.
Humanity’s need for intelligence is vastly underestimated—curing cancer, designing next-gen spacecraft heat shields, and so on remain unresolved. Current tasks focus on automation. The potential for significant growth in intelligence remains enormous; we are not close to a limit.
OpenAI’s “Stargate 500B” Narrative
and Evolving Compute Demand
DeepSeek’s emergence has prompted questions about NVIDIA and OpenAI’s “500B” narrative. It remains unclear how OpenAI will secure training resources, and some believe the 500B story could be a lifeline for them.
There is suspicion around OpenAI’s 500B plan given that it is a commercial company, and massive debt could be risky.
500B is a daunting figure, likely spread over 4–5 years. The main partners are SoftBank (for funding) and OpenAI (for technology). But SoftBank may not have that much liquid capital; it might leverage assets instead. OpenAI itself is not flush with cash. Other participants mostly contribute technology rather than funding. Hence fully realizing 500B would be challenging.
Nonetheless, the 500B claim is not without logic: exploration is extremely expensive in both labor and capital, especially when the path is unclear. The route from o1 to R1 was not trivial. At least when replicating another approach, you know roughly what the end looks like, so you can skip certain steps. True pioneers pay the highest costs. If Google or Anthropic succeed first, they might become the new leaders.
Anthropic might in the future move all inference to TPU or AWS chips.
Chinese companies were once constrained by compute, but DeepSeek proves the possibility of large leaps in efficiency. Perhaps we will see more efficient models not reliant on enormous GPU clusters, or custom chips (AMD, ASIC). From an investment standpoint, while NVIDIA has a robust moat, demand for ASIC may grow.
DeepSeek’s success has limited direct bearing on GPU hardware, but it does highlight China’s engineering prowess. NVIDIA’s real risk would be if AI technology becomes standardized, akin to electricity, and many specialized ASIC chips emerge. While AI is in rapid expansion, NVIDIA’s ecosystem advantage still holds.
Impact on the Public Markets
“Short-term sentiment pressure, long-term narrative endures.”
DeepSeek has had a notable impact on the U.S. AI sector, potentially affecting stock prices. The need for pretraining might slow, while post-training and inference scaling have not risen enough to fill the gap. This leaves a short-term narrative void that could affect market activity.
DeepSeek mostly uses FP8 whereas the U.S. standard is FP16. DeepSeek’s claim to fame is efficient resource usage. Following its big splash last Friday, Mark Zuckerberg raised Meta’s capex forecast, but NVIDIA and TSMC both fell, and only Broadcom rose.
In the near term, DeepSeek might negatively influence share prices and valuations for compute-related or even energy companies. But the long-term AI story remains strong.
Market participants worry that NVIDIA’s transition from H-series to B-series GPUs may create a short-lived gap, and DeepSeek’s developments could add pressure. This might weigh on shares in the short term, but longer term, AI is still in its infancy. If CUDA remains the leading choice, hardware growth potential remains vast.
Open Source vs. Closed Source
“If performance is comparable, it’s a big challenge to closed-source.”
DeepSeek’s significance lies largely in the ongoing debate between open-source and closed-source approaches.
It could drive OpenAI and others to be more secretive with their top models. Currently, the best models are not fully public. If DeepSeek is willing to open its top model, it could pressure other AI firms to eventually reveal theirs.
DeepSeek invests heavily in cost optimization, whereas Amazon and others seem to keep following their own roadmaps. Ultimately, open-source and closed-source can coexist. Universities and smaller labs will prefer open-source. Cloud vendors support both, so the ecosystem is not drastically altered. DeepSeek currently lacks the advanced tool-usage and AI safety efforts that Anthropic has, so if it aims for long-term credibility in Western markets, it must address those aspects.
Open source serves as a margin check on the industry. If open-source approaches attain 95% of closed-source performance at a lower cost, they become very compelling. If performance is close, closed-source providers face considerable pressure.
DeepSeek’s Breakout Impact
“Vision matters more than technology.”
DeepSeek’s emergence has shown the world that China’s AI is formidable. Many assumed China trailed the U.S. by two years; DeepSeek indicates the gap is more like 3–9 months—or even less in some areas.
Historically, whenever the U.S. restricted technology exports to China and China eventually overcame the barrier, it triggered fierce competition. AI may follow a similar pattern. DeepSeek’s breakthrough is evidence of this.
DeepSeek did not appear overnight. Its R1 result is impressive and has caught the attention of U.S. tech circles at every level.
DeepSeek stands on the shoulders of giants, but front-line innovation remains time- and labor-intensive. R1’s success does not imply a simultaneous drop in costs for every other project.
Leading-edge explorers will continue to need significant computing. China, acting as a follower in some respects, can excel in engineering. With fewer GPUs, Chinese teams can still achieve breakthroughs, building resilience and potentially surpassing expectations. This may shape future Sino-U.S. AI dynamics.
Currently, China is largely replicating known solutions. For instance, reasoning was introduced in OpenAI’s o1. Each AI lab’s next goal is identifying the subsequent reasoning method. Possibly “infinite-length” reasoning is the next vision.
The core difference across AI labs is each lab’s vision rather than pure technology.
Ultimately, vision outweighs technology.
Source: https://mp.weixin.qq.com/s/a7C5NjHbMGh2CLYk1bhfYw