PITTI - Blog - Semi-Agentic Knowledge Base

Semi-Agentic Knowledge Base

Semi-Agentic Knowledge Base

2024-01-20 - Content Management, Artificial Intelligence, Web Development

Developer

URL

https://github.com/pappitti/semi-agentic-knowledge-base

Background

When we re-built the pitti.io website last Summer, our objective from day-1 was to automate the knowledge base a much as possible by leveraging AI tools. It is still curated by humans but the heavy lifting is done by LLMs or mere embedding models: in practice, third-party content is selected, URLs are sent to a model which extracts information and drafts entries in the database. And finally a human reviews them before validation.

It has been a long journey as we implemented this pipeline through trial-and-error - an euphemism for what was practically f***ing around and finding out. It’s not perfect and there are still a lot of moving parts since models continue to improve and new tools come up everyday. But we got to a stage where we are comfortable sharing some of the key building blocks to illustrate what we have learnt. This is the background to this Semi-Agentic Knowledge Base project.

We could not just replicate what we had done for our website : it works with a headless CMS (Strapi) hosted on Digital Ocean, a Postgres database (also hosted on DO), the frontend is a Next.js app hosted on Vercel and we use separate python scripts to interface between the CMS API and the LLMs (which we also run locally). Asking users to start four different apps to make basic feature extraction experiments would not make sense : we needed something at the same time simple (one-stop-shop) and more complex (virtually everything in Python).

Django was an obvious choice as it is Python-based, integrates the database, and gives some flexibility around the front-end (don’t get too excited about that last point, it's a massive pain). A great source of inspiration came from this project, which is a Django app scraping and logging Arxiv papers. Although we did not really use much of that code, it was a helpful repo to refer to each time we asked ourselves “how would someone who knows what they are doing think about this" (this project was a first Django app for us). The end result is an app that helps curating your knowledge base using third-party Large Language Models to organize and classify documents in HTML or PDF format. It is dubbed "semi-agentic" as you can directly edit, delete and just create documents manually. The code is publicly available here.

To explain what we’ve done in Django, let’s take a step back and consider the primary motivations and principles:

Not all your eggs in the same basket : OpenAi is the go-to solution to extract information in a structured format, which is essential to feed a database. However, we wanted to build something that could work with other models so that we are not dependant on a single provider delivering services behind an API. The app can currently use two OpenAI models via the Openai python library (any model supporting JSON outputs can be added), or any local model running on a Llama.cpp server.
Local inference : this is a corollary to the previous point but it also addresses potential privacy concerns if you don’t want to send documents to a third-party server. It does not mean that the app runs offline - if you scrape web pages, you need access to the internet - but you can have control over who “sees" what you are doing if it’s important to you.
Build your own product : building everything from scratch is impossible but trying to limit reliance on third party libraries is an effective way to learn and to limit dependencies in your app. This is a double-edged sword though. There are very good solutions out there that you could hook together to build, much faster, an app that essentially does the same thing. And if you build everything yourself, your solution will inevitably look more “hacky". In our case, we tried to keep the number of external libraries below 10 and even the UI does not rely on Bootstrap or Tailwind. It works well but it is messy at times.
Python : we tried to stick to Python to the extent possible, but avoiding Javascript altogether would have been detrimental to user experience. So there is a minimal amount of vanilla Javascript and we kept the scripts in each Django template. Not sure if it is good practice but it avoids jumping around between files to understand which element does what.

Important warning : this is a demo app meant to experiment with feature extraction automation. It is fit to run on device but giving access to third-parties would require material changes around authentication and, more generally, security features, even on a local network.

What we have learnt

Large Language Models handle noisy, unstructured, data very well even where humans would struggle. The larger the model, the better. You can pass HTML code with tags to large models and they will work things out. But that implies a lot of tokens so we used models with a minimum context size of 16k.
Although generation speed is typically the aspect most people focus on when they talk about LLM performance, prompt processing speed is the actual friction point for our use-case. To illustrate this with the example of a local model processing input tokens at 100 tokens/second: a 10k-token document (~5k words) passed into your prompt will require over 1 minute and 40 seconds to process. This is a strong argument in favor of LLM APIs irrespective of the format output issue.
OpenAi models support JSON outputs, either through function calling or by defining output format in the API call. For this app, we did not use function calling. As we write these lines several LLM API services providers are rolling out similar features but it wasn’t the case when we started the project so we relied on GBNF grammars and Llama.cpp. GBNF grammars help but they do not guarantee the output format and model size seems correlated to success rate. For example, Mixtral-Instruct almost never fails but OpenHermes2.5 (based on Mistral 7B) fails about 15%-20% of the time. To get around the issue, we have implemented a feedback loop so that, when the format is incorrect (JSON decoding fails), the output is passed back to the model with an instruction to fix it. In the vast majority of the cases, the trick works but it means that two calls to the LLM are necessary. The second response usually comes back very quickly but time spent fixing the error is never null and, therefore, choosing a smaller local model does not necessarily increase speed. There is an interesting trade-off here and quality of content (the summaries in our case) must also be taken into consideration.

Current TODO

Next iteration should include embeddings for summaries, notes and categories to enable semantic search.
Removing tags unicode blocks from scraped text to prevent prompt injection.
Supporting other models via the OpenAi library (or other methods alongside OpenAi and Llama.cpp) and combine with other tools to force JSON outputs (e.g pydantic-based approaches).
Supporting vision models to select the best image to use as thumbnail given a summary (particularly relevant for pdf extraction which sometimes yields random shapes).

Linked articles

Background

To explain what we’ve done in Django, let’s take a step back and consider the primary motivations and principles:

Not all your eggs in the same basket : OpenAi is the go-to solution to extract information in a structured format, which is essential to feed a database. However, we wanted to build something that could work with other models so that we are not dependant on a single provider delivering services behind an API. The app can currently use two OpenAI models via the Openai python library (any model supporting JSON outputs can be added), or any local model running on a Llama.cpp server.

Local inference : this is a corollary to the previous point but it also addresses potential privacy concerns if you don’t want to send documents to a third-party server. It does not mean that the app runs offline - if you scrape web pages, you need access to the internet - but you can have control over who “sees" what you are doing if it’s important to you.

Build your own product : building everything from scratch is impossible but trying to limit reliance on third party libraries is an effective way to learn and to limit dependencies in your app. This is a double-edged sword though. There are very good solutions out there that you could hook together to build, much faster, an app that essentially does the same thing. And if you build everything yourself, your solution will inevitably look more “hacky". In our case, we tried to keep the number of external libraries below 10 and even the UI does not rely on Bootstrap or Tailwind. It works well but it is messy at times.

Python : we tried to stick to Python to the extent possible, but avoiding Javascript altogether would have been detrimental to user experience. So there is a minimal amount of vanilla Javascript and we kept the scripts in each Django template. Not sure if it is good practice but it avoids jumping around between files to understand which element does what.

What we have learnt

Large Language Models handle noisy, unstructured, data very well even where humans would struggle. The larger the model, the better. You can pass HTML code with tags to large models and they will work things out. But that implies a lot of tokens so we used models with a minimum context size of 16k.

Although generation speed is typically the aspect most people focus on when they talk about LLM performance, prompt processing speed is the actual friction point for our use-case. To illustrate this with the example of a local model processing input tokens at 100 tokens/second: a 10k-token document (~5k words) passed into your prompt will require over 1 minute and 40 seconds to process. This is a strong argument in favor of LLM APIs irrespective of the format output issue.

OpenAi models support JSON outputs, either through function calling or by defining output format in the API call. For this app, we did not use function calling. As we write these lines several LLM API services providers are rolling out similar features but it wasn’t the case when we started the project so we relied on GBNF grammars and Llama.cpp. GBNF grammars help but they do not guarantee the output format and model size seems correlated to success rate. For example, Mixtral-Instruct almost never fails but OpenHermes2.5 (based on Mistral 7B) fails about 15%-20% of the time. To get around the issue, we have implemented a feedback loop so that, when the format is incorrect (JSON decoding fails), the output is passed back to the model with an instruction to fix it. In the vast majority of the cases, the trick works but it means that two calls to the LLM are necessary. The second response usually comes back very quickly but time spent fixing the error is never null and, therefore, choosing a smaller local model does not necessarily increase speed. There is an interesting trade-off here and quality of content (the summaries in our case) must also be taken into consideration.

Current TODO

Next iteration should include embeddings for summaries, notes and categories to enable semantic search.

Removing tags unicode blocks from scraped text to prevent prompt injection.

Supporting other models via the OpenAi library (or other methods alongside OpenAi and Llama.cpp) and combine with other tools to force JSON outputs (e.g pydantic-based approaches).

Supporting vision models to select the best image to use as thumbnail given a summary (particularly relevant for pdf extraction which sometimes yields random shapes).