December 28th, 2024

It’s that time of the year… the travelling, the endless family lunches/dinners, the social pressure to stop working - or if you do, to absolutely not talk about it to anyone. There is no point even trying to do any meaningful work. But you can find many 10-minute windows to reflect on the year that lapsed. And start thinking about next year.

It’s always useful to write this down because it provides a snapshot of your opinions. You can use it later to assess how much your positions and beliefs have evolved. Or how much the goal posts have moved. There is nothing wrong about this, on the contrary : if you’re not comfortable with the idea, you’re bound to be stuck in the past. Submitting yourself to a prior-reality check is healthy, and the prerequisite is to write things down. I wrote notes on tech and AI last year ; not everything aged well but these reference points provided valuable insights this year in terms of what did NOT happen. So let’s do it again this year.

Local LLMs | Other Modalities | Agents | Evaluations | Frontier AI | New Paradigms | Integrated Systems

Local LLMs

It’s hard to write about the progress of local LLMs because the goal posts have moved a lot in the past 12 months. Local models have definitely caught up and exceeded the closed models of 2023 - at least those that were shared with the public. Yet there's still a clear gap between them and today's frontier models (OpenAI, Anthropic and Google). One can argue that frontier LLMs are not optimal tools for specific tasks but, if what you need is a Swiss knife for knowledge work, nothing gets close to the closed models. I am voluntarily omitting Llama-405B and Deepseek V3 here because they are not “truly local" models. Apart from these two, anyone claiming that open source models match frontier models is misrepresenting the reality. But it doesn’t mean that there is nothing to be excited about with local LLMs.

I am convinced that local AI is the way forward in the long term, maybe not for consumer products but very likely for business solutions. We are just not there yet. I’ll park for now the question of evals (if you can’t wait, jump here) and only comment on my own experience of the last 12 months.

I’ve used them a lot. They are getting better and better, and they are, without a doubt, better than the original ChatGPT that 'took the world by storm' two years ago. They can be very useful to automate simple tasks and I have always had 2 or 3 go-to local models for my experiments. Interestingly, the turnover has been very high:

  • In January, for my semi agentic knowledge base, the sub-10B parameters models from Mistral and Meta (llama) were my preferred choices for my MacBook, and Mixtral for my MacStudio
  • In June for nardai, I mainly used most recent Llama and Gemma2 (in both cases, the model in the ~10B category)
  • At the end of the year, the Llama and Qwen2.5 models were my go-to options for all size ranges from 7B to 70B, and I’d add Gemma 2 in the 20-30B range. I can’t really explain why but I like that size bracket, which is only covered by the Qwen2.5 and Gemma2 models.

While anecdotal, my case perfectly illustrates how short the life-cycle of these models is. They rarely remain relevant for more than 6 months. Data provided by Ramp reinforces my intuition that the ~6-month life-cycle is applicable for closed models too. In fact, LLMs are one of the shortest-lived products I know of, relative to the cost to develop them.

These models can simply not be considered assets on frontier labs balance sheets given how quickly they depreciate. I’d be curious to learn about the accounting practices regarding the development costs of these models to understand how deep the hole really is. That's not an immediate concern though: it only matters when no one is willing to pour cash into the hole and try to fill it. And these days, there is always someone. In a sense, some of these start-ups are already too big to fail given the geopolitical implications of being perceived as a leader in AI.

Going back to the ACTUALLY open AI, it’s worth highlighting that local models only concern a very small fraction of the total addressable market: people who can code AND can afford the hardware. I, for one, have the perfect mix of wealth and neuroticism leading me to spend 5x as long launching a local model than opening a website or an app on my phone to access a private API. The old me would use this fact to question the actual size of the open-source market ; the new me would just observe that local models are still far from accessible : there is a cost barrier, and there is an UX barrier. The barrier stands high between me and Llama-405B or more recently Deepseek V31, which are both in frontier territory.

So, at the end of 2024, despite being reasonably knowledgeable about the technical details of local LLMs, paying for my Claude subscription is a no-brainer. I pay for OpenAI too. And I pay for Github Copilot. And I use Gemini for free. I probably won’t pay for all this at the end of 2025 but I am almost certain that I’ll pay at least one subscription for a frontier model. If you had asked me last year, I would have said zero.

Other modalities

More surprising: at the end of 2024, I am still paying a subscription for Midjourney. The image generation landscape seems more open than the LLM landscape, and no one is too big to fail there : a glance at last year’s blogs is sufficient to get a sense of the number of casualties in image-gen. But new players always pop up and fill the void. Flux (Black Forest Labs) replaced the stable diffusion models as my go-to local image-generation solutions…. but Midjourney is always available on my phone via Discord and I have never felt that I was wasting money (not even sure how much) to retain the ability to generate images on my phone at any time. For context, I don’t generate more than 50 images per month. Everything points to the fact that I will still be paying for image generation at the end of 2025, and Midjourney is the best place to get my money…. Unless we FINALLY get true multimodality in 2025.

Given the expectations at the end of 2023, the lack of a true SOTA multimodal model is a major disappointment - and by 'true' I mean SOTA across all modalities. Dall-E is not a serious option ; my assumption is that OpenAI deliberately sandbagged their model to avoid legal issues. So everyone will be looking at Google and xAi in 2025. I’ve not tested the Mistral solution - which I suspect is good but I never felt compelled to try it. Maybe a sign that I'm not their target market. It's a bittersweet realization, given how highly I regarded them in 2023. Mistral is like the ex you still secretly love, but both parties silently agreed to keep their distance.

At the end of last year, I was also convinced that, by the end of this year, we would generate videos like we generate images. When OpenAI presented Sora in February, everything looked ahead of schedule. But virtually nothing happened afterwards, no game-changing step-up in capabilities or accessibility. Kling emerged as a serious option for video generation, better than all existing players and better than Sora based on what I’ve seen (not a first-hand assessment). However, Google presented their own solution, Veo 2, which looks significantly ahead of competition. I don’t really have any use of video generation but I am curious to see if, in 2025, this modality will be adopted like image generation was in 2023. That said, as a European, I am not sure if I’ll ever have access to these tools.

Speaking of the European situation : all labs seem to have different reasons not to serve the continent. Passive lobbying, i.e frustrating the users so THEY do the lobbying, is certainly part of the story for all of them but there is more to it. The AI Act is unlikely the culprit given its implementation will take years2. However, GDPR is in force and should trigger headaches for certain modalities. This struck me earlier this month when I was specifically asked NOT to use Gemini to do a transcript (video-to-text). Gemini handles this extremely well but the participants didn’t want their voices or images to be stored by Google and used for training purposes. It was a fair request in my opinion - and, by law, they have the right to ask that - but I could not guarantee it. So I used Whisper locally instead. It was significantly more work for me, but the local model made everyone else happy. That’s also why I believe that local AI will be important in the future.

Agents

Where to start ? I’m excited about agents and I want to see the glass half-full. But my experiences building agents - whatever ‘agents’ really means - have left me skeptical. I believe that most solutions will never successfully reach production because they are not stable enough. It’s not a technical hurdle, but a communication one. Many developers seem to ignore decades of established communication science. They're reinventing wheels that have been rolling smoothly in other fields for years. So my sentiment about agent companies at the end of 2024 is best summarized by the image below.

Looking ahead to 2025, I expect my sentiment to remain unchanged, though I'd love to be proven wrong. This is one area where moving goal posts could actually unlock tremendous value. I’m just not very optimistic about it.

Evaluations

Read: Artificial Intelligence : what everyone can agree on, an article on questionable commercial practices in the industry

2024 has seen the evaluation landscape deteriorate significantly. Test datasets are knowingly or unknowingly incorporated into training datasets ; irrelevant benchmarks being presented as meaningful metrics; some players are simply making things up. When I wrote about this in September, the subsequent Reflection evals fiasco proved the point within hours. Everyone is complicit, everyone is guilty. No one's hands are clean in this game.

It is not just about the data. LLM-as-a-judge is problematic from a scientific standpoint, and humans-as-judges can be gamed relatively easily as evidenced the ELO scores in the lmsys arena. I’m not sure if there is any other industry where the Goodhart law3 is more apparent. It makes sense after all : learning to reach a target without truly understanding it is the whole point of deep learning.

There is a silver lining to the eval cloud : ARC, or Frontier Math are highly visible, well-funded initiatives that try to push things forward. And at the other end of the spectrum, anonymous, down-to-earth users maintain lists of benchmarks that are still out-of-reach for LLMs. Addressing these blind spots is much more important than adding 1% on already saturated benchmarks. But it won’t be enough.

There is an even more fundamental blind spot in evaluation strategies. Benchmarks/evals tend to focus on single-turn question answering. I believe that it is a critically-flawed approach:

  • it assesses helpfulness to perform better than humans on certain tasks and implicitly values AI models for their potential as swap-in replacements for humans - which they are objectively not in practice today, i.e outside lab settings where we can make the assumption of unlimited resources and zero opportunity-cost. Evals basically measure AIs potential to do better than humans on a binary scale (true/false or better/worse) and often with very questionable averaging methods.
  • it misses the bulk of the value that LLMs bring today, which essentially lies in their ability to assist a human through multiple iterations of a project. The back-and-forth in multiple-turn conversations contributes to enhancing users' abilities. Evals should probably focus more on that dimension. If you use local coding assistants, such as Qwen-2.5-Coder-32B which matches closed models on single-turn generations, you know exactly what I mean. It’s particularly visible for coding tasks because the context window fills up very quickly but I believe that it is true across the board. Unlike most users, I don’t believe that context window extension will fundamentally change RAG but I am convinced that it can significantly improve performance in multi-turn conversations. And a lot of value is hidden there.

For now, this industry is in its infancy: anyone can make any claim and gets away with it because the information asymmetry between users and providers is extremely high. But this is no long-term strategy for start-ups. Deployed AI solutions will necessarily be thoroughly evaluated AI solutions. In terms of sub-segments of the AI-industry, the evaluation space is probably the safest bet you can make right now as an investor.

Frontier AI

Read: the Not_too_fast blog of October 2024.

One of the worst bet you can make is to invest in the [large] foundation model pretraining space, as explained in the linked article.

This segment was always destined for Massive Tech dominance (Microsoft, Amazon, Meta, Google, Alibaba,...) but, for various reasons, the incumbents have been more reactive than proactive and left room for new frontier AI labs to emerge (e.g. OpenAI and Anthropic in the West). Don't be fooled: the real war of attrition hasn't even begun. To stay relevant, the young multibillion-dollar start-ups will need to strike deals, make compromises and … commit acts of betrayal.

The war metaphor is certainly appropriate when it comes to Google : they have turned their ship and the acceleration became notable around April-May. It’s a VERY big ship that has accumulated a lot of momentum in the past 6 months. It simply cannot be stopped, staying in the way seems extremely risky.

As a matter of fact, I’ve used the Google API for Gemini more than I’ve used the OpenAI API in H2 2024. I don’t think I have sent a single API request to OpenAI since April. At least not directly : I use GitHub Copilot in VsCode, mostly with Sonnet, but I sometimes send simple requests to GPT-4o, either because I am on my Macbook or because I’m too lazy on my MacStudio to switch between windows - I personally consider Qwen2.5-Coder-32B equivalent to GPT-4o for single turn generation.

It’s an opportunity to highlight Microsoft’s great job with Github Copilot and VsCode: another very-big-ship story unfolding there, Cursor is in a very uncomfortable spot right now…

One of the big ships that is still under the radar is xAI. They have compute and they have engineers, they have money, they have influence. In other words, they have momentum.

The barriers to entry on the segment of frontier foundation models are incredibly high, arguably too high for any new entrant. But existing players are in sufficient number already. In 2025, it will be harder to name the top-3 players than it was in 2024, and the focus should gradually shift from barriers to entry (building capital moats) to barriers to [customer] switching. There again, the Massive Tech players have a huge advantage because they can integrate their AI solutions into existing products. Anthropic may feel a bit lonely then - there is nothing surprising in seeing them getting closer to Microsoft4.

New Paradigms

The onus so far has been on 2024's unfulfilled expectations, but there were some genuine surprises - and some of them are truly exciting.

Test-time scaling, or 'reasoners' as the models using this technique are often called, tops my list. Downplaying their significance would be disingenuous but I must admit that I struggled to find use cases where OpenAI's o1, currently the best of its kind, really shines. I just need to find a way to integrate it in my workflow : reasoners don’t work well in multi-turn conversations but this is how I prefer working with LLMs. My assumption at this stage is that it’s either a bias on my part which doesn’t matter in the grand scheme of things, or a broader UX issue that will be fixed in the future. Either way, I believe that 2025 will be about test-time scaling.

The promise of test-time scaling is particularly exciting because it offers a path around what seemed like an insurmountable wall for frontier labs. Despite what they say now, there WAS a wall. As the path forward has cleared, we are presented with yet another situation where the glass can be viewed as either half-full or half-empty.

Half-full : test-time scaling could revolutionize everything, including for open-source. Especially for open-source.

There are many use cases where the reasoning traces constitute a valid mitigant to the black-box effect of LLMs. It’s obvious for Doctors, but I can give another example that I discussed recently with European lawyers and judges : in common law (the Anglo-Saxon framework) applying AI is closer to RAG whilst in civil law (the system in continental Europe) AI would need to rely on more conceptual reasoning. Assessing the validity of a model output requires auditing the reasoning, not just checking if a precedent actually existed. It does not mean that there is no jurisprudence you can use in civil law but models specifically reinforced to produce auditable reasoning traces would be philosophically closer to the notion of Justice in Europe.

Putting aside domain-specific applications, I am also excited about the opportunity to directly affect the reasoning at sampling stage, i.e to intervene to steer the model or to take any other actions. For this reason, adaptive sampling is another innovation that I found promising in 2024. ‘Adaptive sampling’ may not be the appropriate term for what I want to describe because I use it as an umbrella catching entropix-style approaches or adaptive temperature but also extending to very different types of interventions to switch back and forth between search or symbolic AI and reasoning. I have the intuition that test-time scaling bodes well for true neuro-symbolic AI, which should lead to much more robust and efficient AI. I’ll try to confirm or refute this intuition in 2025.

Half-empty : the assumption of unlimited resources

Test-time scaling is promising in the sense that it allows to go around the [train-time] scaling wall. But it is by no means a scalable solution today. As of today, the public can use o1 (OpenAI), r1 (Deepseek) and QwQ (Qwen), none of which really meets the definition of a breakthrough in terms of performance. And we have seen very little information about how o3, OpenAI’s next generation reasoner, completely destroyed the hardest benchmarks. For all we know, o3 generated millions and millions of tokens to get to a satisfactory answer, and the information was presented in such a creative way that it wasn’t immediately obvious that the cost was materially higher than what can be considered affordable, let alone cheaper than smart humans.

I won’t make the mistake of discarding test-time scaling because it is stupidly expensive compared to humans today. My grand father used to tell me about the first computer in his company, an IBM 360-20 with 4k memory, in 1966. The computer would process 8 punch cards and perform around 5,700 additions per second. It would literally not fit in my home today. Less than 60 years later, it is the norm to have a device in your pocket that has 10,000 times more processing power and memory while connecting to virtually unlimited cloud resources. For reasoners too, cost will come down and capabilities will improve. Then all we will remember from the o3-ARC chart is that, in 2024, the average Mturker would get ~75% at this test, and human intelligence that gets ~100% cost 3x-4x more. This is THE insight of the chart. The rest is marketing and it is precisely why o3 is stupidly expensive.

The glass is half-empty because OpenAI burnt a stupid amount of money on Frontier Math and ARC as a PR stunt. And the whole point of PR is to send a message. So what is the message ? Who is the target ?

The message is that, with unlimited resources, you can achieve incredible things. It is true. This message can only resonate with two types parties :

  • potential employees who want to achieve incredible things by working for the company with the most resources
  • potential clients who want to achieve incredible things and do not care about the cost.

It is very hard to see how individual users, or even SMBs fit into the strategy here. That’s the black pill of the o3 announcement. This technology, if used - big if by the way-, will be limited to clients who literally print money. And one in particular.

Integrated systems

Read: the Not_too_fast blog of November 2023.

The increasing integration between frontier AI players and their home governments was predictable - I've written about it multiple times, including in the linked article. So 2024's developments weren't particularly surprising. I should have placed bets on a “Manhattan project of AI" at the US federal level this year involving Palantir, amongst others.

There has been no deviation whatsoever from the anticipated trajectory. Yet I’ve been caught off guard numerous times this year by a feeling of being swallowed by a system. An odd sensation of being more controlled than in control.

I appreciate that not everyone will share that view and that it is a personal bias. Let’s take a step back to lay it out : I speak highly of Chinese models (Qwen, Deepseek…) but I will never use these models seriously behind an API. I have a hard stance on Chinese models : locally or nothing. I won't even allow remote code execution when using HuggingFace's transformers. I don’t know what would happen if I did, I don’t want to know ; I skipped the first generation of Qwen models for that reason. I want to be in total control of what happens to my prompt. I want to know if it is modified - if something is appended or if a part is deleted. I want to know if it goes through a classifier and, if so, what happens next. I want to know if the data is stored by the provider. And, of course, I want to be equally in control of the output.

Westerners typically nod along when I express these concerns about Chinese models. But then I ask how different the US players are. Does Anthropic modify your prompt in the API? Does OpenAI classify your prompt to check if you try to access o1’s reasoning traces ? And if you are flagged by the OpenAI classifier, does it trigger an automated process without any human supervision ? When frontier labs collect voice and video samples, what happens to that data ? Would you react differently if these practices came from Booz Allen Hamilton rather than Google or OpenAI?

This might sound like personal paranoia - or maybe just a European cultural bias - but it’s how I’ve always felt about AI, even at the time of classical ML. 2024 has only made this worse. I entertain the naive idea that 2025 will be different but I don’t actually believe it. For me, real change would require open-source models under 50B parameters - may be 30B - to truly match frontier capabilities. It's possible in the long term but unlikely next year. The beating will continue until morale improves…


1 This one is too recent to even comment on but, based on preliminary analysis - not first hand - it potentially is a very big deal. Not for its performance but for limited cost of its training run (~$6m as per their report) which would mandate an update of many beliefs in the AI industry, from businesses economic models, to regulations based on compute, to US trade sanctions.
2 The fact that the best resource to get information on the AI act comes from a lobbying group and not from the original source says a lot about European regulatory system : they can’t even explain themselves in intelligible terms what they do, and let lobbyists deliver interpretations in the top results of Google search. In that instance, nonetheless, the content is good for the implementation timeline of the AI Act.
3 Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
4 Sonnet is now supported by Github Copilot and there is this.
We care about your privacy so we do not store nor use any cookie unless it is stricly necessary to make the website to work
Got it
Learn more