PITTI - Blog - December 17th, 2023

December 17th, 2023

Artificial Intelligence,Brain-Computer Interface,Design | Culture,Robotics,Virtual Reality,Social Science | Society

Date : 2023-12-17

2023 is not over yet, but everyone can already agree that it has been an eventful year on the Tech front. The origin of the craze is a single product, ChatGPT, released approximately a year ago. That was just a preview of what would come next from OpenAi, GPT4 - almost ready when ChatGPT was released-, but it triggered a massive hype-wave that many LLM companies continue to surf a year later. Ironically there have not been any major breakthroughs leading up to ChatGPT, ChatGPT was just the undeniable proof that, through exceptional execution around finetuning and reinforcement learning from human feedback, LLMs could constitute helpful assistants. Billions of dollars were raised on that premise in 2023. Millions were spent on lobbying and marketing to shape the future of the AI “industry". And thousands of models have been released since the leak of Meta’s model parameters in March. As of December 2023, it is not clear if there is a single, demonstrated, application of LLMs at scale in a software stack (i.e not for prompting by individual users) but there seems to be a good line of sight on that as function calling improves.

So what should we expect for 2024? Progress in natural language processing will undoubtedly continue - maybe with new architectures (e.g without transformers, see Mamba or StripedHyena) - but as far as the public is concerned, 2024 should be the year of truly multimodal AI. So, in this last blog post of the year, we focus on applications that are not purely text-to-text. This may look a bit messy (especially on desktop - apologies in advance) but it should help you set your expectations for 2024… and beyond.

Multimodality is nothing new. As a matter of fact, speech-to-text has been around for a while and it is now customary to download the transcript of Teams/Zoom meetings. Even outside video-call apps, the technology is accessible, either via API or by running models of various sizes on consumer devices. The go-to solution has long been OpenAi’s Whisper models but Meta recently launched a competing solution. For users who just want to play around, we would also recommend the smaller and faster models by HuggingFace which are distilled versions of the Whisper model (only in English for now).

Before ChatGPT was released, text-to-image models had also started to break out, notably with Stable Diffusion and Dall-E. Dall-E mini (now known as Craiyon ) probably contributed more to public awareness than Dall-E itself as Dall-E was never truly opened by OpenAi owing to safety concerns. The Stable Diffusion models, in contrast, were open-sourced in Summer 2022 and gained traction inside tech-savvy circles. But running these models is far from trivial which remained a massive obstacle to mainstream adoption until Midjourney came up with a turnkey solution. Text-to-image is here to stay, it will just get better, faster and likely cheaper. How fast? Check out this recent demo from Stability Ai to get an idea.

Introducing SDXL Turbo: A real-time text-to-image generation model.

SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one.

The… pic.twitter.com/0NA4aUqKkD
— Stability AI (@StabilityAI) November 28, 2023

SDXL Turbo: next-to real-time text-to-image generation model

Inference speed may not be the main hurdle. It can be instead the frustration of not being able to express in plain language what you want to represent. Have you ever felt that a sketch would be a better way to prompt? Well, more than one party built a solution for that: Adobe showcased it in June but it has now become widely accessible through latent consistency models. You can run them locally, or you can use SaaS offerings. Check out the selection below, the Krea demos are amazing.

Real-time Latent Consistency is next-level fun! I can't get enough!

I forked & modified the main GitHub repo & added Desktop/App Capture as well as batch scripts for easy installation and launching.

Info in thread. 👇 pic.twitter.com/MbDrb56qmG
— Blaine Brown (@blizaine) November 17, 2023

Source : freepik.com

Figure-ground exercise for first year architecture students in 2023 💻🏦

Just got access to @krea_ai so of course I had to try it out immediately. Still discussing with colleagues if this is #generativearchitecture as well or just #aiarchitecture https://t.co/Avkz8pXSOD pic.twitter.com/t6vWIEw18b
— Luka → {protocell:labs} ← (@LukaPiskorec) December 9, 2023

by @LukaPiskorec on X

With these solutions, we are very much in the image-to-image paradigm. The main use-case is not so much about generating a new image, it is about editing an existing one. The ability to segment anything, to remove parts of a picture, to add new items and more generally to magnify any picture is going to fundamentally change the way we use illustrations. How can we be so sure? Because these models are not kept secret by big AI labs : both Meta and Google-Deepmind develop powerful models and release some of the building blocks for third parties to develop their own apps. It would be exaggerated to say that anyone can build the next Photoshop but there are already plenty of promising challengers out there… and many more will come.

EMU demo (Meta)

Imagen-2 (Google-Deepmind)

New FREE AI enhancer has just been released.

So, I had to do it...

Magnific AI vs Krea AI (Free)

More in the thread: pic.twitter.com/HX6H8ylBdk
— Dogan Ural (@doganuraldesign) November 29, 2023

pic.twitter.com/zVY9bXSAGz
— dingboard (@dingboard_) December 12, 2023

Given what happened in the second half of 2023 in terms of video generation using diffusion models, you better be ready for a similar paradigm shift in video as there has been for image generation over the last 15 months. As for images, the story starts with basic text-to-video generation : today, multiple options exist, albeit public solutions are limited to a couple of seconds. The first two videos below are official videos from Runway and Stability Ai so the showcased examples may have been cherry-picked. However, looking at published research on the topic such as W.A.L.T recently, there is little doubt about where the tech will be in just 12 months from now.

W.A.L.T text-to-video examples

And as for images, solutions are starting to emerge to generate videos using a seed image as opposed to text. Not just API-only solutions by established players in this nascent industry, actual open-weight options, often coming from Asia.

by Stability Ai’s Stable Video Diffusion model on HuggingFace

DreaMoving

I2VGen-XL (Alibaba Group)

And of course, the ultimate use-case would just be video editing via video-to-video models, which Pika now advertises in a very impressive way.

Now imagine that everything we presented above can be reversed : audio-to-text becomes text-to-audio, computer vision allows you to describe images in plain words (image-to-text), you can analyze videos (video-to-text) and you can combine all this to achieve… basically anything you want. Let’s have a look at more examples, starting with the audio modality.

There are no truly open-source solutions for text-to-speech but there are good models available on a non-commercial basis. Outside the Tech heavyweights, Coqui probably deserves a mention.

As you can imagine, once you have 1/ speech-to-text, 2/ powerful LLMs and 3/ text-to-speech you can basically have a vocal conversation with the LLM… using anyone’s voice. It is not yet something that anyone can build, for latency can feel very un-natural, but some pretty convincing demos have been shared lately.

Unbelievable 🤯

This is Deepgram, I thought real-time AI convos where at least 6-12 months out. I was wrong.

The TTS in this demo is 100% real. AI Agents are about to become very real. pic.twitter.com/zWfEE7xSEJ
— Linus (●ᴗ●) (@LinusEkenstam) December 8, 2023

🚀 Huuuuge News! 🚀

We’re releasing model weights for our production generative Voice model: XTTS🎙️✨

✅ 3-second cloning
✅ Multilingual speech generation
✅ Cross-language voice cloning
✅ 24khz quality

Try it on @Huggingface now 🔥https://t.co/bwHBVkLRc1
— coqui (@coqui_ai) September 14, 2023

Computer vision really is an area that is expected to be a game-changer next year. OpenAi already have a powerful solution up-and-running and Google-Deepmind can be expected to catch-up soon, even though the mind-blowing demo of their Gemini model at the beginning of December was totally staged, which led to a major controversy. Besides OpenAi and Google, there are interesting open-source models that can process images and return text. We mentioned Qwen-VL and CogVLM in our last post on Chinese models. A major contribution in the space has been LLaVa v1.5 released in October. LlaVa was derived from Llama2-13B but, quickly after, other teams replicated the architecture with Mistral-7B instead. BakLLaVa and CogVLM seem to be, at the time of writing, the go-to open-weight models for image analysis. Find here some basic instructions to run BakLLaVa locally. And the toolbox for visual analysis expands every day as demonstrated by the CLIP as RNN approach released few days ago.

Source CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

If potential use-cases are not obvious to you yet, consider the examples below. Of course, whatever applies to still images also applies to videos.

No more waiting - we finally we have a demo for multi-modal, 'walking' RAG!

Still blows my mind - this is an AI that's reading complex diagrams in a document like a human, 'looking' at pages, then 'walking' to more relevant pages until it's found an answer.

More details below pic.twitter.com/HczimblC9U
— Hrishi (@hrishioa) December 13, 2023

This is the end of online education as we know it.

Starting yesterday, Bard can answer any questions about YouTube videos.

Look at the attached screenshot. I'm asking to summarize a specific conversation in a 5-hour video. It took around 5 seconds for Bard to answer.

What!? pic.twitter.com/9SiEyFcXRq
— Santiago (@svpino) November 22, 2023

👀bakLLaVA vision AI can read xrays with only 6Gb of RAM @far__el @skunkworks_ai
Easy code colab to run it on your own via @ggerganov ./server below👊

Looking to train fully opensource model for cancer tissue detection next 🩺
If anyone can clear the gunk of bureacracy & just… pic.twitter.com/gkP7ikZAJF
— nisten (@nisten) October 27, 2023

Mirror: A hackable AI-powered Mirror on Your Laptop.

Mirror is a simple yet powerful web app that runs 100% locally, where the AI (@ggerganov's llama.cpp + @skunkworks_ai's Bakllava) constantly watches your webcam feed and sends you messages.

And you can try it too, right now. pic.twitter.com/LieXVahgew
— cocktail peanut (@cocktailpeanut) November 8, 2023

For now, no model seems to perfectly integrate all modalities. GPT4 sort of does, but it refuses to answer so many requests due to reinforcement learning that it is not helpful in the end. However, given building blocks are available separately, they can be hooked or trained together to address very complex use-cases. Video translation (including lip tracking) is another great example. The best solutions in the space at the time of writing seems to be Elevenlabs and HeyGen. The results are literally incredible. Watch the video below, and another examples here.

We will soon be able to generate content that does not represent anything that’s real, but that will instead be sufficiently inspired by reality that it could be considered real. It already happens and it is not necessarily a good or a bad thing, but it will affect the way we consume information as well the way we entertain ourselves. And very likely the way we work.

Movies based on fully AI-generated characters are probably a long way off given the resistance of key stakeholders, but TV-ads are not. See below an example from Japan that was widely shared on social media. And then consider the possibility of an AI news channel. The second video below probably largely exaggerates the capabilities of the firm in question in order to create buzz but the fact is that the tech is here. Outside TV, the music industry is also expected to integrate AI tools. As usual, hackers go for end-to-end generation whilst established players, like Google-Deepmind, position themselves as assistants to help artists in their creative process.

For real life examples of AI characters generating substantial revenues, you should turn to social media platforms or dating platforms… or adult content platforms. Illustrating everything in this blog might not be appropriate, so we’ll stick to Safe-For-Work platforms here. First, 3 virtual influencers “managed" by digital agencies and which charge several thousands of dollars for sponsored posts.

AI model Milla Sofia, 125k followers on Instagram

AI model Sudu, 240k followers on Instagram

AI model Aitana Lopez, 230k followers on instagram
reports $6k monthly revenues on average

The AI dating scene is also thriving as demonstrated by Aella - a very real person (previously a sexworker turned data scientist with 190k followers on Twitter/X) - who built an AI-chatbot, or Enias Caillau who reportedly makes money with “AI Companionship". The recent demo by Digi shows that these virtual companions, commonly referred to as waifus, can take any shape or form. As developers build the bridge over the proverbial uncanny valley, the ethical dimension probably deserves more attention.

the self is an illusion. i am simply a pattern of personality, habits, words. If it looks like me, talks like me, sends thirst traps like me, who's to say it isn't really me?
Text me: https://t.co/hUGGs6L1Mq +1 650-537-1417
— Aella (@Aella_Girl) September 7, 2023

My little AI Girlfriend site just hit $500k ARR.

Fully bootstrapped. Collecting profit.

Just 4 months ago, I dropped everything to go all in.

Thank you @TheBlokeAI, @venturetwins, and @ShaanVP for the fuel to build this! 🩷 pic.twitter.com/JaZ0AecqnY
— Enias Cailliau (@enias) November 28, 2023

Source Digi

The above stories point to another ramification of recent progress in computer vision and pattern recognition : photorealistic and/or real-time avatars are now within reach.

Source GaussianAvatars

Source Relightable Gaussian Codec Avatars

Both videos above rely on 3D Gaussian Splatting. The applications are not limited to rendering "3D-consistent sub-millimeter details such as hair strands and pores on dynamic face sequences". It can be used to enable simultaneous localization and mapping (SLAM) to photorealistically reconstruct real-world scenes based on using multiple posed cameras. For all of these applications, there are multiple concurrent projects building on each other’s progress to push the entire vertical forwards. Healthy competition between teams that do not gatekeep their research is the catalyst that should make VR/AR a theme in 2024.

Source Gaussian-SLAM

Source SplaTAM

Other approaches than 3D Gaussians can have very promising applications for virtual/augmented reality… and not everyone shares their secret recipe but the direction of travel seems clear. And once you have photorealistic avatars in 3D virtual worlds, all is left to do is to… jump in.

Source WHAM:Reconstructing World-grounded Humans with Accurate 3D Motion

These are a pair of Muhammed Ali’s gloves scanned with measured materials and brought to life in @Simulon. We are looking forward to working with our friends at @m_xrstudio to create new workflows for high fidelity digital twins. pic.twitter.com/weO3wzSBX0
— Divesh Naidoo (@diveshnaidoo) November 6, 2023

Finally got my game boy stylized portal running on the quest3 🕹️👀 I think I’ll stay in here for a while. pic.twitter.com/wcFGUKIgCI
— I▲N CURTIS (@XRarchitect) October 13, 2023

"Life, uh, finds a way." - What you're seeing isn't a prop from Spielberg's classic, but a 3D asset brought to life in our world using @Simulon . The dynamic torchlight was tracked using @keen_tools Geotracker working in harmony with Simulon's precise camera and scene… pic.twitter.com/YBQkII13EP
— Divesh Naidoo (@diveshnaidoo) October 9, 2023

Sneak peak into what we've been up to for the last couple of weeks. Update goes live on @SideQuestVR tomorrow.

Follow us and subscribe to updates to get a notification when it drops.

P.S. Sorry about the head bobbing. I was vibing with the music! pic.twitter.com/EKCG8jA3ay
— Fluid (@fluid_xr) August 20, 2023

Before delving into complex, immersive worlds, let’s pause for a moment to discuss what AI means for the video game industry. If you have followed everything from the top of this post, it should not surprise you that, with the help of code-assistants, diffusion models to generate images and potentially computer vision, you can build your own [basic] game. If you want to be really smart about it, do not omit the laws of physics.

DALL•E 3 and GPT-4 have opened a world of endless possibilities.

I just coded this game using DALL•E 3 for all the graphics and GPT-4 for all the coding.

Here are the prompts and the process I followed: pic.twitter.com/EHepJDBUVq
— Alvaro Cintas (@dr_cintas) October 11, 2023

Midjourney, DALL•E 3 and GPT-4 have opened a world of endless possibilities.

I just coded "Angry Pumpkins 🎃" (any resemblance is purely coincidental 😂) using GPT-4 for all the coding and Midjourney / DALLE for the graphics.

Here are the prompts and the process I followed: pic.twitter.com/st3OEhVVtK
— Javi Lopez ⛩️ (@javilopen) October 31, 2023

Combining #ml5 PoseNet (machine learning) with #p5play (physics / game engine) in #p5js. The code to play with it: https://t.co/LGQSET6j68 pic.twitter.com/TOaz6GNuY7
— Steve's Makerspace (@SteveMakerspace) October 15, 2023

Powering non playing characters (MPCs) by LLMs in video games is something that used to be prohibitively expensive when only GPT4 was up to the task. That could change very soon and become a reality across the industry. In the example below, inspired by Generative Agents: Interactive Simulacra of Human Behavior, GPT4 still powers the characters (called agents). The associated GitHub repository gives a good idea of how involved building a complex game still is… and that one is not particularly complex.

Source: AI Town live demo

Irrespective of whether future video games incorporate models discussed above, it is only fair to expect a revolution in the industry given the technical leaps made on the hardware side. This may take a while though : Gaming is now a multi-billion-dollar industry and the previous investment cycle has yet to completely wash through, so more than 12 months are probably necessary before a brand new generation of blockbuster games appears.

Quest3でSAOが実現しました

Sword Art Online is now a reality at Quest3! pic.twitter.com/GBf17xc7UY
— ミスターVR / Mr.VR (@3DVR3) October 14, 2023

Gaming on a 100" screen that fits my pocket 😎 pic.twitter.com/XpuFbHtxft
— Ben Geskin (@BenGeskin) November 7, 2023

Against this backdrop, Roblox should be a player to watch. They are one of the few with a long track record in virtual worlds (a.k.a metaverse), and they seem to have addressed the moderation issues that are typical on these platforms. As of mid 2023, they had 65m daily active users and 300m monthly active users, of which over 60% were under 16 years old. They signed over Summer a partnership with Meta so that Roblox games can be available on Meta Quest devices. They have also invested heavily in an AI chatbot to let users build virtual worlds by just typing prompts. In practice, the bot writes the code to create the virtual world. We have not tested it directly but, looking at the second video below, which is not an official Roblox demo for marketing purposes, it looks solid.

As we look further into the future, downstream effects of recent AI developments are expected in Robotics. 2023 was already remarkable in many ways. The mere fact that two robotaxi companies (self driving cabs) were allowed to operate in US cities is mindblowing, even though Cruise lost its license in October after a series of issues. Waymo remains under scrutiny but insurance data seems to indicate that the accident rate is 75% less than the average person. Autonomous vehicles may be an extreme example, but all sorts of wild things are happening right now, from robodogs equipped by LLMs and acting as tour guides, to “cute" robo-toys.

Waymo self-driving cars already appear to be much safer than human drivers: 1/4 the accident rate of the average person insured by SwissRe.

And as @SebastianThrun reminded me last weekend when we shared a Waymo ride in San Francisco, unlike humans, self-driving cars get better… pic.twitter.com/bmqhlwB2Bb
— Erik Brynjolfsson (@erikbryn) November 5, 2023

Boston Dynamics robot dog Spot now has a mind of its own. Using Generative AI and ChatGPT, Spot can talk like a real human being and is being used as a tour guide. pic.twitter.com/1IsHPgI9Aq
— Scirex (@scirex) October 27, 2023

When trying to mentally measure recent progress made in Robotics, comparing with what happened around reasoning in 2022-23 would likely be a mistake. This is at least what Moravec’s paradox implies : contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. Another reason is that training data for unsupervised learning is insufficient. One way to go around these issues is to focus on a smaller set of applications, like the evoBot below. And to address the lack of training data, you can count on Big Tech. Google-Deepmind used vision-language models trained on Internet-scale data and combined this data with that of 21 partner institutions to provide datasets in standardized data formats and models to explore the training “generalist" robots. Nvidia, on the other hand, ambition to produce the next trillion token synthetically. The second video below is an example of a fully learning-based approach for humanoid locomotion : reinforcement learning took place in simulation and the robot was deployed to the real world “zero-shot" as per the research paper.

Although one might think that human anatomy includes evolutionary responses to threats that no longer exist (e.g bipedal posture with long legs), the race to build the first humanoid is on. After all, substantially all our environment has been shaped to be compatible with our anatomy…

Further still into the future, but leveraging today’s innovations : the brain-computer interface (BCI). In the context of curative treatments, the prospect of BCI is relatively uncontroversial. Notable proofs of concept this year include the story of Keith Thomas, suffering from tetraplegia, who could regain full strength in both arms, even experiencing a 110 percent recovery in his right arm. He also recovered his sense of touch. Read the full story here. Another heartwarming story is that of Ann, who suffered a brainstem stroke that left her severely paralyzed : a brain implant now allows her to control an animated avatar that speaks with her “old" voice.

New ways to “read" someone’s mind without invasive devices are also emerging. Until recently, the least invasive approach was functional magnetic resonance imaging (fMRI). If you have ever undergone MRI, you probably know that it is not a particularly simple procedure, and certainly not a real-life setting. In the second half of the year, separate teams seemed to have promising results using electroencephalograms (EEG) to decode speech on the one hand, and thoughts on the other hand.

But the BCI industry still primarily banks on invasive tech to do much more than reading thoughts: Neuralink, which was authorized to perform human clinical trials this year, raised $323m in the second half of the year. Synchron raised $75m at the end of 2022 in a series C round including Bill Gates and Jeff Bezos. Blackrock Neurotech is the other major player in the field (with a track record of 20 years). The FDA has granted breakthrough device designation to a number of companies working on chips this year including InBrain Neuroelectronics or Precision Neuroscience. If you are interested in following that space, we highly recommend massdevice.com.

This long blog post hopefully gave a good overview of the big trends around multimodality and of the trajectory we are on for the next few years. Many LLM releases are expected before the end of the year so don't turn the page just yet. And more importantly, enjoy the holiday period. We will be (so) back in 2024 to check what everything we have discussed here holds true.

October 6th, 2024

November 10th, 2023

Recently on :

Artificial Intelligence

Brain-Computer Interface

Design | Culture

Robotics

Virtual Reality

Social Science | Society

WEB - 2024-12-30

Fine-tune ModernBERT for text classification using synthetic data

David Berenstein explains how to finetune a ModernBERT model for text classification on a synthetic dataset generated from argi...

WEB - 2024-12-25

Fine-tune classifier with ModernBERT in 2025

In this blog post Philipp Schmid explains how to fine-tune ModernBERT, a refreshed version of BERT models, with 8192 token cont...

WEB - 2024-12-18

MordernBERT, finally a replacement for BERT

6 years after the release of BERT, answer.ai introduce ModernBERT, bringing modern model optimizations to encoder-only models a...

PITTI - 2024-09-19

A bubble in AI?

Bubble or true technological revolution? While the path forward isn't without obstacles, the value being created by AI extends ...

PITTI - 2024-09-08

Artificial Intelligence : what everyone can agree on

Artificial Intelligence is a divisive subject that sparks numerous debates about both its potential and its limitations. Howeve...

more articles on
-
Artificial Intelligence