MicRou
MicRou
2024-02-28 - Dataset, Data Visualization
Developer
URL

Project Presentation

The MicRou dataset was put together for a RAG project in memory of Michel Rouger : the documents were part of his personal archives and include his own work as well as work produced by other authors during projects he ran.

This dataset includes approximately 850 documents in French (books, articles, minutes of debates) produced between 1998 and 2020. It covers justice and law, finance and economics, management, healthcare, education, sports, history and geopolitics... Overall it represents between 1.5m and 2m tokens depending on the tokenizer you use.

In many cases, the documents stem from a larger source that was broken down in parts that could be considered independently (e.g. different chapters of a book or different articles of a newsletter). It is nonetheless possible to recombine the entire source : within a "dossier", documents can be grouped by date and then, within each date group, they can be ordered by index. Documents that do not come from a larger source have an index of 0 by default.

Two versions of the dataset have been posted on GitHub and HuggingFace:

  • microu : the main dataset
  • microu-chunked : documents were broken down into chunks that meet context-window constraints of tokenizers. All chunks have an expected size of less than 500 tokens with most tokenizers whilst still maximizing average chunk size. Chunking strategy is detailed below.

Processing steps

All documents are in markdown format. Original sources were either in Word or PDF format. PDF documents were first converted to Word and all Word documents were consolidated. Note : using Word as an intermediary step allowed to quickly screen through the documents converted from PDF to check that layout was fine and quickly make adjustments where necessary. However, very large documents cause performance issues in Word so, in practice, we avoided exceeding 500 pages per Word document. Finally, the Word documents were converted to markdown using pandoc in order to allow programmatic text processing.

Deduplication was done using an embedding model and assessing semantic similarity between documents. In hindsight, that was probably not an optimal strategy. We used BGE-m3 which is a multilingual model (thus handling French correctly) and has a context window of 8k (thus avoiding chunking). With this model, we produced dense embeddings of all documents (BGE-m3 also allows ColBERT embeddings which we have not used for this project)

Results from BGE-m3 were sufficient to identify similar documents but suggested that dense embeddings of long documents in French would likely not be sufficient for retrieval applications. We turned to Solon, with a context window of 512 tokens, which implied chunking.

The chunking strategy leveraged the existing sections of the documents as much as possible using a recursive function:

  • starting from the top, split the document into sections based on markdown headings level (starting from level 1).
  • tokenize the resulting sections and check if they fit into the maximum number of tokens
  • if a section fits, save the chunk (this guarantees that a section is not split). If not, apply the same process to the section but this time splitting the text based on headings at a level below. The recursive function drills down the sections each time the text does not fit.
  • if there is no headings in the text, the function looks for lists (ordered or unordered) and breaks down the text accordingly.
  • if there is no list, the text is broken down into paragraphs and if paragraphs are still too long, they are further broken down into sentences.
  • Finally, starting from the most granular level of each subsection (bottom-up), chunks are re-aggregated with an algorithm that looks for the minimal number of groups that can be formed with the chunks whilst respecting the constraints of the maximum tokens per chunks. Nothing too fancy : this is brute-force optimization.

The python script used for chunking is included in the chunking folder of the GitHub repository.

License

The dataset is currently under restrictive license (CC-BY-NC-SA). We plan to convert it to an open license once we have finalized the review of the right holders. Some documents may be excluded following the review, but we also plan to add others over time.

All articles in French produced by PITTI will be added to the MicRou dataset. If you want to contribute to a large, open and high quality dataset in French, please reach out.

Next steps

It is currently envisaged to build a web app that lets user query and navigate the dataset, and potentially generate summaries based on retrieved chunks.

In the meantime, Tensorflow's Embeddings Projector is a good tool to explore the dataset. It does not really handle markdown and reading experience could be improved for our specific use-case but Embeddings Projector is, in fact, what we used for the video below. Here is how to use it:

By default, dimension reduction is done with PCA but you may find UMAP (bottom left) more user-friendly

We care about your privacy so we do not store nor use any cookie unless it is stricly necessary to make the website to work
Got it
Learn more