The “DeepSeek moment”

Noumenal Labs

Jan 309 min read

Updated: 6 days ago

by Noumenal Labs

The tl;dr

A single monolithic generalist artificial intelligence (AI) model trained on publicly available data is easily dethroned. A business model that is based upon fitting a single AI model to enormous public datasets is brittle and risky.
Minor algorithmic tweaks can easily be incorporated into the architecture of contemporary AI models, and these can be developed and used by any party. Even a slight tweak to existing methods can destroy a company’s competitive edge.
The two main competitive advantages of the big players in AI — extreme costs and closed source architecture — no longer ensure market superiority.
A new approach is needed, one that leads away from monolithic generalist models and large language models (LLMs): The future of AI is the rapid development of specialist models that are optimized for proprietary datasets.

Noumenal Labs is pioneering an approach to the development of AI systems, resulting in a sounder path to its commercialization.

We are designing a new family of specialist energy-based AI models, as well as a model development and optimization platform that:

(1) rapidly incorporates the latest tricks and architectures without retraining from scratch,

(2) combines, composes, and makes use of multiple domain-specific models and data to inform its decision making, and

(3) makes full use of proprietary data in service of specific goals.

This new approach has lasting financial value.

General context

What is the public narrative around DeepSeek? On January 20th 2025, DeepSeek, a new artificial intelligence (AI) company focused on developing open-source large language models (LLMs), disrupted the AI industry — some would say profoundly so.

In the rapidly developing AI space, companies frequently claim to have “disrupted the industry”. In fact, most releases are not newsworthy, or hardly so. But clearly, the markets have responded strongly to this news of DeepSeek’s new breakthrough, and the leading American AI companies have lost hundreds of billions USD in value on the stock market.

So, why was this model release so impactful?

The reason is that the newest LLM from DeepSeek, called DeepSeek-R1, has exposed deep cracks in the business model that is being pursued by the biggest players in the American AI industry. Arguably, they have lost their competitive edge, which was premised on controlling access to models that are extremely expensive to train. DeepSeek-R1 has seemingly undermined this premise.

Preliminary results suggest that DeepSeek-R1 performs similarly to leading models, such as market leader OpenAI's GPT-4o and o1, matching their functionality and their performance on a host of benchmarks. By now, it is a truism that standard state of the art LLMs are very costly to train — with direct costs stated at around 10M USD, and indirect costs estimated to be on the order of 100M USD.

Shocking the current AI monoculture, DeepSeek-R1 was — at least according to its developers — trained for a small fraction of the cost of state of the art LLMs in the US market: a comparably minuscule 6M USD. Moreover, DeepSeek-R1 is entirely open source, meaning that anyone — including competitors, domestic and international — can build on top of the initial success of R1, without restriction, via fine tuning or knowledge distillation.

As a result, it is now clear that the AI arms race is no longer limited to players with large institutional backing.

This is the narrative being spun in the industry. But does it hold up to scrutiny?

Is the tech overhyped?

DeepSeek-R1 is clearly quite technically impressive. It implements a few clever tricks, allowing it to work effectively on older hardware.

But DeepSeek did not develop all the technology that went into R1, which likely obscures much of the cost that went into R&D. DeepSeek-R1 is firmly grounded in open source methods and in past research and engineering achievements. R1’s achievements rest squarely on “the shoulders of giants” as it were: DeepSeek benefitted massively from the open source ecosystem and from what others in the field had already achieved.

The two most discussed technical aspects of the new model are the mixture of experts (MoE) approach and the latent attention mechanism. Neither of these are new.

Using a MoE approach means that, rather than training one monolithic model, multiple LLMs can be trained in parallel, with each specialized for a given domain. This effectively parallelizes the costly process of fine tuning the model, providing some of the core benefits of fine tuning and effectively enabling the user to simultaneously create multiple fine-tuned models. It also means that each expert sees only a subset of the data, preventing cross-contamination from topic-irrelevant data sources that are better handled by a different expert. The MoE approach is also computationally frugal in that it effectively turns off parts of the model when they are not contributing to inference.

That said, while the implementation is impressive, the MoE approach is older than most AI researchers. Much like the transformer architecture, a MoE is not a research innovation. What is impressive is the engineering feat of combining LLMs and MoEs into a workable product.

Latent attention is also not new – it has always been part of the standard transformer architecture. Latent attention is a method of optimizing the KV cache — short for key and value matrices cache — allowing an AI system to retain (or cache) previously generated key and value matrices. This compresses the KV cache significantly, saving on memory, and therefore saving on compute cost. As with the MoE approach, however, it is important to note that this is not actually an innovation.

DeepSeek also acknowledges reliance on knowledge distillation, a technique used to replicate the functionality of a large model with a smaller model.

But it has been known for a long time that modern deep neural networks trained via gradient descent must be over-parameterized during training. It is only after training that deep learning models can be pruned to reduce computational cost. Heavy reliance on knowledge distillation suggests that DeepSeek’s achievement should not be compared to that of other companies who trained their models from scratch or de novo. Rather, one should interpret DeepSeek’s approach as a form of unsupervised parallel fine tuning of multiple models. While this is quite an accomplishment, it is not a fair comparison to de novo learning of an entire model. Indeed, this supports the notion that learning the DeepSeek model de novo would not result in the reported levels of performance — and that its level of performance was only possible because the starting point was a pretrained model. It is worth noting that the precise role of knowledge distillation in this work is somewhat difficult to decipher and precisely how this was implemented is a bit of a mystery. It seems likely that the KV cache compression was achieved via a low rank approximation KV computation extracted from the pretrained model on which DeepSeek was based. This suggests that de novo learning with the low rank KV cache might not lead to the same level of performance.

Finally, of note, this was not the first lightweight LLM. Meta’s LLaMA showed some time ago that a functionality similar to ChatGPT’s could be achieved by a much smaller model. We have known that this kind of compression could be achieved for a very long time.

Is the market impact exaggerated?

All this calls into question the rationality of the market response to what the DeepSeek team has accomplished.

We agree with commentators like Siavash Alamouti and Martin Vechez that the economic advantage of DeepSeek is probably being overstated, and the market tanking so spectacularly is an overreaction. As Alamouti points out, the direct training cost for OpenAI’s new o3 model was reportedly around 10M USD — which is not such a far cry from DeepSeek’s 6M USD. In fact, that figure — approximately half the cost — makes sense in light of one of the critical technical innovations of DeepSeek-R1. In short, their KV cache compression uses latent attention, which precisely halves the compute and memory costs of that aspect of the model.

In our view, DeepSeek’s claim of 6M USD direct training costs doesn’t tell the full story, in particular because it obscures the true total training costs. Much of the gains in performance in R1 were likely a result of massive amounts of data curation, the cost of which is unknown. We agree with Vechez, who points out that a better approximation for the total costs of training R1 is probably closer to the 50-100M USD range, which puts it roughly on the same order of magnitude as OpenAI’s o1 model.

We also believe that it is also unlikely that this will impact hardware companies and compute providers like NVIDIA. An often overlooked aspect of increasing efficiency is that if you can do more with less, you can also do even more with the same. So, even if architects like DeepSeek-R1’s do lead to less overall compute cost, there is little to no reason to think that anyone will do anything other than squeeze more compute out of each dollar — not spend less on compute.

But ultimately, these arguments are irrelevant because, at the end of the day, this turn of events demonstrates that, in the AI space, a single company can take an action that completely undermines the business model adopted by the big LLM companies.

Is this the end of “scale is all you need?”

The success of the approach adopted by DeepSeek demonstrates the folly of relying on brute force scaling as a means of enhancing performance. Meaningful innovation requires novel architectures, but even a slight tweak to existing methods, like low rank attention, can destroy a company’s competitive edge. This is one key takeaway from the DeepSeek moment.

Another key takeaway is that a single jack-of-all-trades model is easily outperformed by a MoE and that the future of AI will be built from the conjunction of small(er) domain-specific expert models and an expert identification algorithm that identifies the right model for the task at hand. This is an explicit repudiation of the contemporary approach to building AI agents by just taking a single massive model and prompting it to “be an expert in X”.

Perhaps the most crucial take away for the market is that a company built around a single model or algorithm is a risky investment. The market has learned this the hard way. If that model is stolen or replicated, or slightly improved upon, the whole business model falls apart and so spending a significant amount of money fitting a generalist model to publicly available data is a recipe for obsolescence.

While this may sound like a bad surprise for the AI market, many of us saw this market shift coming, and have been building alternatives that are real solutions. The most promising path forward is to forego generalist, monolithic, closed source model architectures.

Our approach at Noumenal Labs

Noumenal Labs is proposing an approach to AI and its commercialization that directly addresses the issues that we discussed in this blog post. Our approach overcomes technical shortcomings in state of the art architectures and the associated business model, allowing us to enhance existing technologies — while offering a genuine alternative to the AI monoculture.

We are developing families of energy based models that are specialized domain experts, which are designed to be fit for purpose within specific domains and are also able to discover the objects that populate a world model in an unsupervised, data driven way. Critically, these models cannot be mere word models. To design and build super intelligent machines that think like us and understand us and each other the way we do, we need to build machines that are endowed with the ability to represent discrete things in the world and abstract concepts distilled from the behavior of these objects — not just words. Cognitive science has demonstrated that words are a poor representation of our thoughts and our thoughts are a mere reflection of reality. As a result, language is at best two degrees removed from reality.

An ideal machine intelligence models the world as it is, using the most powerful language available for object-centered causal reasoning. Similarly, an ideal agent would behave like a scientist: it must abstract or distill a world of objects from its sensory data and do so in a way that leverages what science has told us about how the world works. It must act in a way that tests hypotheses and continually learns from its interactions with its environment. This grounds the agent in the same domain in which our knowledge is grounded: the physical world. Language models grounded in imprecise descriptions of reality will never get us the kind of superintelligence that actually augments our understanding.

To achieve this vision of the future of AI, we are building an entire pipeline for model development and tools enabling a community of developers and researchers to use their proprietary data to build specialist models that develop a scientific understanding of data. We also combine these deep innovations with the best tricks in current state of the art (such as MoE and knowledge distillation) and with the energy based architectures that underwrite the most sophisticated models of reality used in contemporary science.

Moreover, inspired by both how the brain works and how physical systems like neuronal networks self-organize, we are also developing novel architectures that enable these models to learn to cooperate with and learn from one another, enabling a true collective intelligence.

We believe that this research and development will be intelligent systems that not only empower people, but also enhance our understanding of the complex world in which we live. Additionally, this focus on collaborative agent design will result in AI systems that explicitly align with our intentions and values, by directly modeling our beliefs and values. This results in agents that understand the world in the same way that we do and that work with us toward the common goal of a safe, sustainable society of human and artificial agents.

And we’re hiring! If you have the skills and interests to join us, please consider applying!

We look forward to sharing more about our approach to design, research, and development in the coming weeks.

The “DeepSeek moment”

Recent Posts