PITTI - Blog - November 10th, 2023

November 10th, 2023

Artificial Intelligence,Regulations | Policy,Social Science | Society,Security | Surveillance | Privacy,Infrastructure

Date : 2023-11-10

The rise of Large Language Models (LLMs) corresponds to a major paradigm shift, not only for computer science. For over 150 years, the use of symbols instead of words has been the backbone of scientific progress as mathematical abstraction permeated all fields. Don’t be mistaken, symbols still underpin LLMs, but the fact that you can now use sentences as an interface with software revives languages, and therefore countries or regions. This occurs after decades of globalization, which had convinced everyone that borders and governments were irrelevant. Covid revealed important flaws that had been ignored for many years in several areas (public health, information, supply chains) and AI looks like a real opportunity to regain control for parties whose influence had been on the decline. Today, we examine the role that governments, regulatory bodies or cultural leaders want to play in the AI revolution. Without any guarantees of success.

Last time, we touched on a number of models of non-american origin : Mistral for Europe, Qwen for China and Falcon for Abu Dhabi. Given the appetite of the open-source community for new models to play with, either for finetuning or distillation, any new model with state of the art capabilities benefits from an echo chamber with global reach. For this reason, these tools can constitute relatively cheap levers of soft power, which is very clear in the case of Flacon.

Falcon-40B is one of the few truly open-source models. As mentioned last time, there are nuances of openness in AI, and Abu Dhabi’s Technology Innovation Institute opted for one of the most permissive license, Apache 2.0. It wasn’t the case initially though : the Falcon models released in March were not open for commercial use, and the transition to Apache 2.0 only took place at the end of May. Openness grew on the TII, to the point that, in September, they released the largest open-source model in the market with Falcon-180B i.e with 180 billion parameters. Like Llama-2 models, Falcon-180B is not “truly" open-source but this level of openness is unparalleled for such a large model (the largest Llama-2 model has 70 billion parameters). Model size only loosely correlates to model performance so Falcon-180B is not necessarily better than Llama-2-70B, but in terms of marketing, its effect is undeniable; Abu Dhabi sent a message.

Under the hood, the main difference is how inputs are processed : how they are tokenized and how/when neurons activate, which largely depends on training. There can also be architectural differences and/or reinforcement learning steps to influence the final product. The bottomline is that, upon release, noone can be sure of what information can be processed reliably.

This is true for any model, even if a detailed research paper is published. And this begs the question of whether certain concepts can be censored or promoted.

Although these concerns seem legitimate for models sponsored by States where basic human rights are not upheld, all models are potentially exposed to these issues. It is something we first touched on in March, when we commented on this article from Quantamagazine on LLMs and cryptography. In June, new research showcased some of these scenarios, both for content injection (promotion) and over-refusal (censorship).

Source : arxiv

Chinese models

Now let’s address the elephant in the room : Chinese models. Even though we know the control that Chinese authorities have over Tech players - we’ll come back to this later - models originating from China are not State-sponsored. They can hardly be compared to Falcon. These models are particularly interesting because they are developed by bluechip Tech companies and are recognized as some of the best options on the market.

Before covering LLMs, let's on talk about embeddings. Embeddings convert objects like text, images or audio into vectors (arrays of numbers) representing the meaning or context of the object. When we said in the introduction that LLMs relied on symbols, this is what we meant. But embeddings do not need to be used in context of LLMs, they can be used on a stand-alone basis, for semantic search for example. To that end, embeddings models exist and, at the time of writing, 5 of the top-9 embeddings models in Hugging Face’s Massive Text Embedding Benchmark (MTEB) Leaderboard were Chinese. BGE models are developed by the Beijing Academy of Artificial Intelligence, GTE models are developed by the Alibaba DAMO Academy, whilst the origin of stella-large-zh is not entirely clear to us (for all info about Infgrad is in Chinese). Cohere models only work via API, which is not a deal-breaker for everyone but the Chinese models are the go-to solutions for embeddings these days.

Source : HuggingFace MTEB Leaderboard

The Qwen suite is one of the most praised families of small LLMs in the open-source landscape. It was developed by Alibaba Cloud and includes a 7B model and a 14B model which are typically compared to LLama-2 and Mistral models. There is also a multimodal model, Qwen-VL, derived from the 7B model. It was one of the first small models to support computer vision, something that will be the norm by the end of the year.

On that note, Alibaba and Tencent recently invested in Zhipu Ai, which just released CogVLM, an extremely promising open-source vision model. Back in March, we pointed out that multimodal models would force us to re-think many aspects of cyber-security as, despite the claims in the GPT4 system card, it seemed obvious that the model could crack captchas. 7 months later, you can just go to the CogVLM GitHub repository to train a model to solve captchas…

Source : GitHub : CogVLM

Note that, at the time of writing, CogVLM is not available on HuggingFace nor is the latest generation of Qwen models announced on October 31st. It may have something to do with the blocking of HuggingFace in China. It’s notable that GitHub is still accessible, which shows how critical this service is.

It is fair to assume that the increasing constraints around data usage and model training in China (see below) will eventually affect the momentum of Chinese players but, for the moment, they seem to be thriving. As a matter of fact, as we write this post, 4 of the top-5 small open-source base LLMs (just pretrained, not finetuned) according to the Hugging Face leaderboard are Chinese. This includes models from Qwen (Alibaba), InternLM (The Shanghai Artificial Intelligence Laboratory, in collaboration with SenseTime Technology, the Chinese University of Hong Kong, and Fudan University) and 01-Ai (a startup from Chinese computer scientist and former Google exec, Kai-Fu Lee - introducing another strange variation of “open-source")

License Yi

HuggingFace : "small" pretrained LLM Leaderboard

It is relatively easy to “game" evaluations by training the models on the test data sets. But self-disclosed benchmarks are always useful to find what models are considered “comparable" by the developers themselves. The Chinese models mentioned above are often compared to Baichuan2 another open-source family of models developed by Baichuan Intelligence inc (again backed by Alibaba and Tencent).

And here is something interesting uncovered by Susan Zhang : the model’s vocabulary has unusual special tokens. In simplified terms, a token is a part of a word but sometimes, in order to process certain sequences of words that have a specific meaning, the entire sequence is represented as one token. According to Susan Zhang, in Chinese language, special tokens are easy to spot given few “normal" tokens have more than three characters. And the list of tokens with more than 3 characters in Baichuan2’s vocabulary includes "Epidemic prevention", "Coronavirus disease", "Committee", "Xi Jinping", "Coronavirus", "Nucleic acid amplification testing", "New coronary virus", "wear mask", "Communist Party", "People's Republic of China", "Communist Party of China", "General Secretary Xi Jinping", "Copyright belongs to the original author", "The copyright belongs to the original author". In an extended version of the vocabulary released mid September, she even noticed a token meaning “Guided by Xi Jinping's thoughts of socialism with Chinese characteristics in the new era" or "On our journey to achieve beautiful ideals and big goals" .

One can look at the glass half empty, but when it comes to open-source models the glass is actually half full because third-parties can perform these analyses and make their opinion as to whether they should use the model. Open-source has downsides but imagine the scenario where the model is a complete blackbox for users: who even knows if other actions could be initiated when a neuron gets activated ? You may not even need such granular level of monitoring if you trust this recent work on explainability by Anthropic, which proposed a new architecture with monosemantic units of analysis.

For the moment the science fiction scenarios seem remote, but this test of the Baidu model at the beginning of September looks like a warning that one should take seriously. In comparison, the fact that 01-Ai’s Yi model is so transparent may be a reason to worry for Kai-Fu Lee (just look up Chen Shaojie from DouYu, Boa Fan from China Renaissance or Guo Guangchang from Fosun). It is worth highlighting that, 10 years ago, Kai-Fu Lee led a successful campaign to unblock Github in China.

So what do you do when you're making an LLM in China and it does this pic.twitter.com/4I1X35TBIs
— xlr8harder (@xlr8harder) November 6, 2023

I don't know why I'm spending my Friday night looking at tokens, but here we are.

The Baichuan tokenizer has a 64k vocab size, of which ~28k contains Chinese characters, and ~1.5k are >= 3 characters.

Perhaps unsurprisingly, we find:

"Epidemic prevention"
"Coronavirus disease"… https://t.co/KN6KUWq3o5 pic.twitter.com/JjVfZWoVcs
— Susan Zhang (@suchenzang) September 2, 2023

The Baidu testing was performed by ChinaTalk, a great source to keep track of China’s policy regarding AI and Tech more generally. For hardware, SemiAnalysis and Tom’s Hardware are our go-to sources, albeit some of the information about Chinese progress in chip manufacturing should probably be taken with a pinch of salt. Chris Miller talked about a “Chip War" in his book, and communication about capabilities of the respective camps sometimes does look like war propaganda. For context, China has been subject to restrictions regarding Nvidia GPUs for a long time. Until recently, only exports of A100 and H100 GPUs to China (and to countries suspected to trade with China) were banned. This also included all the equipment to build them (see here). Most recently, the US extended restrictions to less powerful models like H800 or A800, but also gaming GPUs that can be used for machine learning like RTX 4090. It is not just about impeding Chinese training runs : given the size of the Chinese population, inference may be a bigger challenge according to Dylan Patel from SemiAnalysis. In retaliation, Beijing has blocked an acquisition by Intel of an Israeli manufacturer which had been pending for a year.

Regulatory frameworks

One area where China is not lagging behind is AI regulation. China already has a framework up-and-running and it covers the entire data value chain. For years, China have had control over data : despite the Personal Information Protection Law, which came into effect at the end of 2021 (here is a comparison with GDPR in the EU), the government can have access to data under the National Intelligence Law introduced in 2017, which mandates all Chinese companies to cooperate with national intelligence efforts. This has been the primary source of concern around TikTok lately, and the reason why the app is blocked in several countries. Before that, the National Intelligence Law triggered several high-profile scandals around Huawei, as the Chinese company had contracts for infrastructure hardware in the West. There have been accusations of espionage (an Executive of the company was detained in Canada for 3 years between 2018 and 2021) and multiple contracts were rescinded. There is limited evidence that the Chinese authorities actually use data from foreign users but they have, for several years, incentivized Tech companies to leverage as much data as possible to build AI solutions with controversial use-cases internally. Like the social credit system or ethnic recognition in video surveillance technology. With the rise of LLMs, China was first mover by introducing, last Summer, new regulations applying to public facing models. This new framework insists on social mores, ethics, and morality, and puts the onus on training data (model providers should register and share training data on request). Seven agencies are in charge of overseeing compliance with the new rules. China is clearly more concerned about upholding “core socialist values" than about an existential threat for the human race. If anything, the x-risk they seem to perceive is for the Establishment. Nonetheless the country joined the US, the UK and the EU at the AI Summit in the UK last week, and signed a declaration to address AI’s “catastrophic risk to humanity", thereby showing that they will surf the global Ai regulation wave to reinforce their control.

Just before the AI Summit, the US issued an executive order to announce their own regulatory framework for Artificial Intelligence. If you have followed the Chinese policy, you can’t help seeing similarities, even though the underlying motivations differ. The scope is more narrow in the sense that it gives leeway for small models and it is sector-focussed - e.g separate sections for biotechnologies or cybersecurity or justice. But at the same time, the scope is de facto much broader : not because it is not limited to public facing applications, but rather because all Infrastructure as a Service players meeting the thresholds (equivalent to ~50k H100 GPUs) are US-based so the entire world is concerned. One interesting/innovative part of the executive order is that it also details how the US intend to promote innovation by attracting talent to the US (i.e streamlining visa applications). Otherwise, just like China, the US will establish a number of agencies to oversee the Ai sector, and, like China, the US legitimately focus on training data (where do they come from? Are they biased?) and on the destabilization potential of synthetic digital content - obviously without the “core socialist values" spin. However, to date, there is no solutions to satisfy the requirement of both countries to watermark AI-generated content.

So, when the AI Summit started, both the US and China had released their roadmaps. The EU has been working on the “world's first comprehensive AI law" for over two years, but the democratic process is hardly compatible with the current pace of the AI industry (which may be a preview of what bureaucratic organizations can expect). It certainly did not help that intense lobbying for strong controls only started at the very end of the process AND that Mistral, who sit on the opposite end of the sprectrum, subsequently established themselves as the regional champion. So France (Mistral) and Germany (Aleph Alpha) now fundamentally oppose to inflexible frameworks, which leads to a deadlock. This is very fresh news, so it may be too early to declare the AI Act dead. In any case, what we had drafted before today's session could be recycled into an obituary of the AI Act if the discussions collapse: it was initiated between 2017 and 2019, at the request of the European Council and, since 2021, the European Commission have worked with experts in the field to put together a proposal that addresses EU’s usual definition of “unacceptable" risks (cognitive behavioral manipulation, social scoring, biometric identification). The paradigm obviously shifted at the end of 2022 with the release of ChatGPT, the first mainstream - and versatile - application of AI, nonetheless the “risk of extinction" only became a key theme in Summer 2023. If it’s real, how could they have missed it for so long? Or are they just going with the flow, playing the game of internal politics?

Throwing internal politics into the mix is certainly helpful to understand the dynamic in the UK, the host of the AI Summit.

Mitigating the risk of extinction from AI should be a global priority.

And Europe should lead the way, building a new global AI framework built on three pillars: guardrails, governance and guiding innovation ↓ pic.twitter.com/t7UA9rgN1H
— European Commission (@EU_Commission) September 14, 2023

Back in April, the country was expected to be the main beneficiary of EU’s inability to come up with a framework that actually fosters innovation (including open-source). At the time, both the Tories and Labour praised the opportunities of AI, but the stance quickly changed as existential risks took center stage, which set the tone of the AI Summit. The debate involving diplomats and Tech experts in Bletchley looked healthy and balanced but it would be naive to trust UK politicians to display such self-control as we get closer to the general elections deadline. If you have enjoyed the show of UK politics since the Brexit referendum, you’re probably already waiting with popcorn that a double-decker bus starts touring the country to pass the message that “AI will kill us all".

AI risks

Using the worst possible measurement method, i.e our own assessment of vibes on social media, it seems to us that the AI-doom rhetoric has peaked in September 2023 and is losing a bit of momentum. There is still a lot of wind in Doomers’ sails and their credibility does not seem to be at risk so long as they don’t bring up genetically modified viruses that turn your body into bitcoin-mining machines (yes, it’s a true story). From an outsider’s perspective, polarization around AI-risk is extremely difficult to analyze as even the Founding Fathers of the field, the likes of Le Cun, Hinton or Bengio, those who should be the most knowledgeable about capabilities and potential threats, can have contrasting views. A good question may be : are there Founding Mothers of AI? Here the point is not to highlight that women are underrepresented in the field - which is true - but just to ask female researchers what they think? Amongst experts, a gender divide around existential risks has been noticed and this probably deserves to be further explored.

All 7 women in this survey said "no" to "Is an AGI likely to cause civilizational disaster if we do nothing?".

6/7 warned of that narrative distracting from corporate power grabbing and other immediate risks.

Only 5/15 men said "no" to this question.

(p-value for χ² = 0.014) https://t.co/RowZg3U99Z
— Jeremy Howard (@jeremyphoward) June 23, 2023

- Only reading papers first-authored by men is a sex/gender "skew".
- Only appreciating papers first-authored by men is a sex/gender "bias", aka, sexism.
- Using a platform that influences thousands of influential people and choosing to reference only men: "misogyny".
(1/2) https://t.co/anToDWb4lD
— MMitchell (@mmitchell_ai) July 11, 2023

Me, waiting for Hinton to cite anyone who's not a white man in his talk 💀 https://t.co/XDT8ttmK6G pic.twitter.com/EZmRbfa6e6
— Sasha Luccioni, PhD 💻🌎🦋✨🤗 (@SashaMTL) July 10, 2023

Since last Summer, a number of polls have shown that, beyond the relatively small niche of AI experts, the broader population largely rejects the prospect of powerful AI. Without downplaying the significance of these polls, and despite blatant question framing in some cases, there is nothing surprising about the concerns around a new technology, in particular at a very early stage. Policy makers will have the responsibility to decide whether this cognitive bias represents an absolute line in the sand or whether they should ignore it to facilitate adoption.

Today, extinction risks are theoretical as they imply models that are orders of magnitude more powerful than current tech, and AGI is very unlikely to emerge within the LLM paradigm. But there are aspects that governments and policy makers can address immediately, starting with Intellectual Property. In creative industries, generative AI has already started to disrupt established frameworks:

Downstream : most countries have ruled that model outputs cannot be protected¹ so everything comes down to the terms of use (license) of the models which is a commercial matter. Restrictions to use the outputs are probably extremely difficult to enforce but, so far, there has not been any public example of dispute on this front. It does not mean that users abide, it just means that companies do not bother going after them, unless they are a direct threat to their business. So small players may have a Damocles sword hanging over their heads, but everyone behaves... for the moment.
Upstream : the situation is much more heated regarding training data as artists - who relied on usual IP protection - argue that model trained on their work without consent allows infinite plagiarism. The most prominent case in the US (McKernan, Andersen and Ortiz vs. StabilityAi, Midjourney and Deviant Art) is for the moment not going so well for the plaintiffs. Writers with similar claims against Meta and OpenAi are facing similar pushbacks from judges in the US, at least on the argument that model outputs violate copyrights (the claim that training represents infringement is dealt with separately). Japan smartly nipped this in the bud in June, by clarifying that training was “fair use" but outputs cannot directly replicate the style of an Artist. Over Summer we have seen large generative Ai companies embrace this approach (DALL-E declines prompts to imitate contemporary Artists). And even for training, OpenAi have engaged with media companies in order to use their content without raising copyright issues. The call for data partnerships was even extended to the public.

Copyright issues and plagiarism are very good reasons for Artists to worry but, ultimately, the fear is that they will just lose their jobs because Ai is able to provide satisfactory results much faster and at scale. Today the threat is most obvious for the creative industries but substantially all desk jobs will be exposed to AI competition in a not-too-distant future. And governments will likely need to step-in to help the concerned workers to transition. To date, there does not seem to be any examples of AI transition policies so the concerned workers are left to organize to fight back. This is precisely what the Writers Guild of America have done last Summer: after almost 5 months of strike, they secured, for 3 years, the assurance that AI-generated content cannot be used as source material nor used to edit a writer's work (writers are allowed to use it as if it were their own work). For Actors, the issue seems even more complex but they have leverage for now. Analysts working in Finance should not be so lucky when their bosses realize that there is no better return on investment than to “augment" half of the workforce with AI, and to let go the other half.

Addressing potential systemic risks associated with Tech players is also on the horizon for regulators. It is not just about AI here, but really about datacenters. AWS or Microsoft are already under scrutiny in the UK for example. And systemic risks can only increase from here as private models like OpenAi's (Microsoft Azure) or Anthropic's (AWS) start being integrated into every company’s software stack, and all workflows imply multiple API calls to a remote server. If you are curious, decentralized alternatives exist and seem to get some traction, Petals is the most notable one. In Europe, the main prudential agencies have teamed up to issue a report on information and communication technology providers (ICTs) in context of the Digital Operational Resilience Act (DORA). There is also an antitrust dimension to these considerations, which is highly likely to be a theme in the near future. And of course a geopolitical dimension given all top infra players are US companies, and therefore subject to US laws. The EU's crackdown on Tech oligopolies is a not-so-subtle way to counter the American soft power. But the probability of success for the EU is low: as there is no real alternative to American "compute", and given the requirements of the Executive Order, the US have de facto a similar control over AI as they have over global financial transactions through the US Dollar. What is interesting is that, even in the US, the oligopolies are being challenged, as demonstrated by the recent lawsuit of the Federal Trade Commission and 17 States against Amazon, or Google's landmark antitrust trial for alleged anticompetitive deals with Apple and other companies to place their search engine.

Next time, we’ll cover recent developments in robotics, Virtual/Augmented Reality, brain computer interface… and more generally all the themes we have not elaborated on since June.

¹ Worth reading the strange story of this inventor who claims that his models are sentient so he can protect the outputs.

December 17th, 2023

October 8th, 2023

Recently on :

Artificial Intelligence

Regulations | Policy

Social Science | Society

Security | Surveillance | Privacy

Infrastructure

WEB - 2024-12-30

Fine-tune ModernBERT for text classification using synthetic data

David Berenstein explains how to finetune a ModernBERT model for text classification on a synthetic dataset generated from argi...

WEB - 2024-12-25

Fine-tune classifier with ModernBERT in 2025

In this blog post Philipp Schmid explains how to fine-tune ModernBERT, a refreshed version of BERT models, with 8192 token cont...

WEB - 2024-12-18

MordernBERT, finally a replacement for BERT

6 years after the release of BERT, answer.ai introduce ModernBERT, bringing modern model optimizations to encoder-only models a...

PITTI - 2024-09-19

A bubble in AI?

Bubble or true technological revolution? While the path forward isn't without obstacles, the value being created by AI extends ...

PITTI - 2024-09-08

Artificial Intelligence : what everyone can agree on

Artificial Intelligence is a divisive subject that sparks numerous debates about both its potential and its limitations. Howeve...

more articles on
-
Artificial Intelligence