December 17th, 2023

2023 is not over yet, but everyone can already agree that it has been an eventful year on the Tech front. The origin of the craze is a single product, ChatGPT, released approximately a year ago. That was just a preview of what would come next from OpenAi, GPT4 - almost ready when ChatGPT was released-, but it triggered a massive hype-wave that many LLM companies continue to surf a year later. Ironically there have not been any major breakthroughs leading up to ChatGPT, ChatGPT was just the undeniable proof that, through exceptional execution around finetuning and reinforcement learning from human feedback, LLMs could constitute helpful assistants. Billions of dollars were raised on that premise in 2023. Millions were spent on lobbying and marketing to shape the future of the AI “industry". And thousands of models have been released since the leak of Meta’s model parameters in March. As of December 2023, it is not clear if there is a single, demonstrated, application of LLMs at scale in a software stack (i.e not for prompting by individual users) but there seems to be a good line of sight on that as function calling improves.

So what should we expect for 2024? Progress in natural language processing will undoubtedly continue - maybe with new architectures (e.g without transformers, see Mamba or StripedHyena) - but as far as the public is concerned, 2024 should be the year of truly multimodal AI. So, in this last blog post of the year, we focus on applications that are not purely text-to-text. This may look a bit messy (especially on desktop - apologies in advance) but it should help you set your expectations for 2024… and beyond.

Multimodality is nothing new. As a matter of fact, speech-to-text has been around for a while and it is now customary to download the transcript of Teams/Zoom meetings. Even outside video-call apps, the technology is accessible, either via API or by running models of various sizes on consumer devices. The go-to solution has long been OpenAi’s Whisper models but Meta recently launched a competing solution. For users who just want to play around, we would also recommend the smaller and faster models by HuggingFace which are distilled versions of the Whisper model (only in English for now).

Before ChatGPT was released, text-to-image models had also started to break out, notably with Stable Diffusion and Dall-E. Dall-E mini (now known as Craiyon ) probably contributed more to public awareness than Dall-E itself as Dall-E was never truly opened by OpenAi owing to safety concerns. The Stable Diffusion models, in contrast, were open-sourced in Summer 2022 and gained traction inside tech-savvy circles. But running these models is far from trivial which remained a massive obstacle to mainstream adoption until Midjourney came up with a turnkey solution. Text-to-image is here to stay, it will just get better, faster and likely cheaper. How fast? Check out this recent demo from Stability Ai to get an idea.

Inference speed may not be the main hurdle. It can be instead the frustration of not being able to express in plain language what you want to represent. Have you ever felt that a sketch would be a better way to prompt? Well, more than one party built a solution for that: Adobe showcased it in June but it has now become widely accessible through latent consistency models. You can run them locally, or you can use SaaS offerings. Check out the selection below, the Krea demos are amazing.

Source : freepik.com

With these solutions, we are very much in the image-to-image paradigm. The main use-case is not so much about generating a new image, it is about editing an existing one. The ability to segment anything, to remove parts of a picture, to add new items and more generally to magnify any picture is going to fundamentally change the way we use illustrations. How can we be so sure? Because these models are not kept secret by big AI labs : both Meta and Google-Deepmind develop powerful models and release some of the building blocks for third parties to develop their own apps. It would be exaggerated to say that anyone can build the next Photoshop but there are already plenty of promising challengers out there… and many more will come.

EMU demo (Meta)
Imagen-2 (Google-Deepmind)

Given what happened in the second half of 2023 in terms of video generation using diffusion models, you better be ready for a similar paradigm shift in video as there has been for image generation over the last 15 months. As for images, the story starts with basic text-to-video generation : today, multiple options exist, albeit public solutions are limited to a couple of seconds. The first two videos below are official videos from Runway and Stability Ai so the showcased examples may have been cherry-picked. However, looking at published research on the topic such as W.A.L.T recently, there is little doubt about where the tech will be in just 12 months from now.

W.A.L.T text-to-video examples

And as for images, solutions are starting to emerge to generate videos using a seed image as opposed to text. Not just API-only solutions by established players in this nascent industry, actual open-weight options, often coming from Asia.

And of course, the ultimate use-case would just be video editing via video-to-video models, which Pika now advertises in a very impressive way.

Now imagine that everything we presented above can be reversed : audio-to-text becomes text-to-audio, computer vision allows you to describe images in plain words (image-to-text), you can analyze videos (video-to-text) and you can combine all this to achieve… basically anything you want. Let’s have a look at more examples, starting with the audio modality.

There are no truly open-source solutions for text-to-speech but there are good models available on a non-commercial basis. Outside the Tech heavyweights, Coqui probably deserves a mention.

As you can imagine, once you have 1/ speech-to-text, 2/ powerful LLMs and 3/ text-to-speech you can basically have a vocal conversation with the LLM… using anyone’s voice. It is not yet something that anyone can build, for latency can feel very un-natural, but some pretty convincing demos have been shared lately.

Computer vision really is an area that is expected to be a game-changer next year. OpenAi already have a powerful solution up-and-running and Google-Deepmind can be expected to catch-up soon, even though the mind-blowing demo of their Gemini model at the beginning of December was totally staged, which led to a major controversy. Besides OpenAi and Google, there are interesting open-source models that can process images and return text. We mentioned Qwen-VL and CogVLM in our last post on Chinese models. A major contribution in the space has been LLaVa v1.5 released in October. LlaVa was derived from Llama2-13B but, quickly after, other teams replicated the architecture with Mistral-7B instead. BakLLaVa and CogVLM seem to be, at the time of writing, the go-to open-weight models for image analysis. Find here some basic instructions to run BakLLaVa locally. And the toolbox for visual analysis expands every day as demonstrated by the CLIP as RNN approach released few days ago.

If potential use-cases are not obvious to you yet, consider the examples below. Of course, whatever applies to still images also applies to videos.

For now, no model seems to perfectly integrate all modalities. GPT4 sort of does, but it refuses to answer so many requests due to reinforcement learning that it is not helpful in the end. However, given building blocks are available separately, they can be hooked or trained together to address very complex use-cases. Video translation (including lip tracking) is another great example. The best solutions in the space at the time of writing seems to be Elevenlabs and HeyGen. The results are literally incredible. Watch the video below, and another examples here.

We will soon be able to generate content that does not represent anything that’s real, but that will instead be sufficiently inspired by reality that it could be considered real. It already happens and it is not necessarily a good or a bad thing, but it will affect the way we consume information as well the way we entertain ourselves. And very likely the way we work.

Movies based on fully AI-generated characters are probably a long way off given the resistance of key stakeholders, but TV-ads are not. See below an example from Japan that was widely shared on social media. And then consider the possibility of an AI news channel. The second video below probably largely exaggerates the capabilities of the firm in question in order to create buzz but the fact is that the tech is here. Outside TV, the music industry is also expected to integrate AI tools. As usual, hackers go for end-to-end generation whilst established players, like Google-Deepmind, position themselves as assistants to help artists in their creative process.

For real life examples of AI characters generating substantial revenues, you should turn to social media platforms or dating platforms… or adult content platforms. Illustrating everything in this blog might not be appropriate, so we’ll stick to Safe-For-Work platforms here. First, 3 virtual influencers “managed" by digital agencies and which charge several thousands of dollars for sponsored posts.

AI model Milla Sofia, 125k followers on Instagram
AI model Sudu, 240k followers on Instagram
AI model Aitana Lopez, 230k followers on instagram
reports $6k monthly revenues on average

The AI dating scene is also thriving as demonstrated by Aella - a very real person (previously a sexworker turned data scientist with 190k followers on Twitter/X) - who built an AI-chatbot, or Enias Caillau who reportedly makes money with “AI Companionship". The recent demo by Digi shows that these virtual companions, commonly referred to as waifus, can take any shape or form. As developers build the bridge over the proverbial uncanny valley, the ethical dimension probably deserves more attention.

Source Digi

The above stories point to another ramification of recent progress in computer vision and pattern recognition : photorealistic and/or real-time avatars are now within reach.

Both videos above rely on 3D Gaussian Splatting. The applications are not limited to rendering "3D-consistent sub-millimeter details such as hair strands and pores on dynamic face sequences". It can be used to enable simultaneous localization and mapping (SLAM) to photorealistically reconstruct real-world scenes based on using multiple posed cameras. For all of these applications, there are multiple concurrent projects building on each other’s progress to push the entire vertical forwards. Healthy competition between teams that do not gatekeep their research is the catalyst that should make VR/AR a theme in 2024.

Source SplaTAM

Other approaches than 3D Gaussians can have very promising applications for virtual/augmented reality… and not everyone shares their secret recipe but the direction of travel seems clear. And once you have photorealistic avatars in 3D virtual worlds, all is left to do is to… jump in.

Before delving into complex, immersive worlds, let’s pause for a moment to discuss what AI means for the video game industry. If you have followed everything from the top of this post, it should not surprise you that, with the help of code-assistants, diffusion models to generate images and potentially computer vision, you can build your own [basic] game. If you want to be really smart about it, do not omit the laws of physics.

Powering non playing characters (MPCs) by LLMs in video games is something that used to be prohibitively expensive when only GPT4 was up to the task. That could change very soon and become a reality across the industry. In the example below, inspired by Generative Agents: Interactive Simulacra of Human Behavior, GPT4 still powers the characters (called agents). The associated GitHub repository gives a good idea of how involved building a complex game still is… and that one is not particularly complex.

Irrespective of whether future video games incorporate models discussed above, it is only fair to expect a revolution in the industry given the technical leaps made on the hardware side. This may take a while though : Gaming is now a multi-billion-dollar industry and the previous investment cycle has yet to completely wash through, so more than 12 months are probably necessary before a brand new generation of blockbuster games appears.

Against this backdrop, Roblox should be a player to watch. They are one of the few with a long track record in virtual worlds (a.k.a metaverse), and they seem to have addressed the moderation issues that are typical on these platforms. As of mid 2023, they had 65m daily active users and 300m monthly active users, of which over 60% were under 16 years old. They signed over Summer a partnership with Meta so that Roblox games can be available on Meta Quest devices. They have also invested heavily in an AI chatbot to let users build virtual worlds by just typing prompts. In practice, the bot writes the code to create the virtual world. We have not tested it directly but, looking at the second video below, which is not an official Roblox demo for marketing purposes, it looks solid.

As we look further into the future, downstream effects of recent AI developments are expected in Robotics. 2023 was already remarkable in many ways. The mere fact that two robotaxi companies (self driving cabs) were allowed to operate in US cities is mindblowing, even though Cruise lost its license in October after a series of issues. Waymo remains under scrutiny but insurance data seems to indicate that the accident rate is 75% less than the average person. Autonomous vehicles may be an extreme example, but all sorts of wild things are happening right now, from robodogs equipped by LLMs and acting as tour guides, to “cute" robo-toys.

When trying to mentally measure recent progress made in Robotics, comparing with what happened around reasoning in 2022-23 would likely be a mistake. This is at least what Moravec’s paradox implies : contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. Another reason is that training data for unsupervised learning is insufficient. One way to go around these issues is to focus on a smaller set of applications, like the evoBot below. And to address the lack of training data, you can count on Big Tech. Google-Deepmind used vision-language models trained on Internet-scale data and combined this data with that of 21 partner institutions to provide datasets in standardized data formats and models to explore the training “generalist" robots. Nvidia, on the other hand, ambition to produce the next trillion token synthetically. The second video below is an example of a fully learning-based approach for humanoid locomotion : reinforcement learning took place in simulation and the robot was deployed to the real world “zero-shot" as per the research paper.

Although one might think that human anatomy includes evolutionary responses to threats that no longer exist (e.g bipedal posture with long legs), the race to build the first humanoid is on. After all, substantially all our environment has been shaped to be compatible with our anatomy…

Further still into the future, but leveraging today’s innovations : the brain-computer interface (BCI). In the context of curative treatments, the prospect of BCI is relatively uncontroversial. Notable proofs of concept this year include the story of Keith Thomas, suffering from tetraplegia, who could regain full strength in both arms, even experiencing a 110 percent recovery in his right arm. He also recovered his sense of touch. Read the full story here. Another heartwarming story is that of Ann, who suffered a brainstem stroke that left her severely paralyzed : a brain implant now allows her to control an animated avatar that speaks with her “old" voice.

New ways to “read" someone’s mind without invasive devices are also emerging. Until recently, the least invasive approach was functional magnetic resonance imaging (fMRI). If you have ever undergone MRI, you probably know that it is not a particularly simple procedure, and certainly not a real-life setting. In the second half of the year, separate teams seemed to have promising results using electroencephalograms (EEG) to decode speech on the one hand, and thoughts on the other hand.

But the BCI industry still primarily banks on invasive tech to do much more than reading thoughts: Neuralink, which was authorized to perform human clinical trials this year, raised $323m in the second half of the year. Synchron raised $75m at the end of 2022 in a series C round including Bill Gates and Jeff Bezos. Blackrock Neurotech is the other major player in the field (with a track record of 20 years). The FDA has granted breakthrough device designation to a number of companies working on chips this year including InBrain Neuroelectronics or Precision Neuroscience. If you are interested in following that space, we highly recommend massdevice.com.

This long blog post hopefully gave a good overview of the big trends around multimodality and of the trajectory we are on for the next few years. Many LLM releases are expected before the end of the year so don't turn the page just yet. And more importantly, enjoy the holiday period. We will be (so) back in 2024 to check what everything we have discussed here holds true.

We care about your privacy so we do not store nor use any cookie unless it is stricly necessary to make the website to work
Got it
Learn more