Sector Classification Assistant
Sector Classification Assistant
2023-12-03 - Artificial Intelligence, Web Development
Developer
URL

This Sector Classification Assistant takes a company description as input, and suggests the most likely sectors and sub-sectors according to the GICS® classification or the NACE classification. This demo app was one of the tools built as part of a broader project exploring the many ways to use AI to classify companies by sector according to predetermined frameworks. Read the blog post here.

To find the relevant sub-sectors, this assistant performs a semantic search, assessing similarity using the K-Nearest Neighbors (KNN) method with cosine similarity. This tool narrows down options but should not be trusted to identify the correct answer.

This application was built using the Vite template, which provides a minimal setup to get React working in Vite with HMR and some ESLint rules. The app runs on the client side. This guarantees data privacy but it means that performance can vary between users. Client-side inference requires downloading a model first. Models are downloaded from the HuggingFace hub; bge-small-en-v1.5 is downloaded by default. We kept the option to use another model, all-MiniLM-L6-v2, which may yield better results based on our tests. But all-MiniLM is not downloaded until the user clicks on the relevant radio button.

The lists of sectors for GICS and NACE are saved in json files. For each level 4 sector, the description was embedded using both bge-small-en-v1.5 and all-MiniLM-L6-v2 and the resulting vectors were saved in the json files. When a user opens the app, the classification dictionaries are loaded so the embedding models, once downloaded form HuggingFace, are only ever used to embed the user input (company description). The app uses the Transformers.js library. We implement our own cosine similarity algorithm.

The embeddings models we use are in ONNX format, which allows fast inference in users' browsers. These versions of the models are provided by Xenova, also author of Transformers.js. Please consider supporting them.

Because we wanted to make this demo app as lightweight as possible, we focussed on smaller models. Larger models tend to be more precise so, if precision is what you need, you should consider using bge-large-en-v1.5 instead (see below).

The code for this app is available in this GitHub repository. It can be adapted to run other models and other classifications.

  • Running other versions of the models

    The app runs the quantized version of the models by default. This allows faster download and inference of the model but quantized models are less precise. If you wish to use the original versions of the models, you can do so by passing { quantized: false } as an optional argument when pipeline() is called to create a new instance in the worker.js file. Please refer to the Transformers.js documentation.

  • Running other models

    If you wish to run other models by modifying MyEmbeddingPipeline in worker.js, note that you need to modify the classification files (GICS.json and NACE.json) as the sector description embeddings are stored there, but only for bge-small-en-v1.5 and all-MiniLM-L6-v2. This GitHub repository (in python) can provide tools to do this by just passing the name of the new model. It can also be used to embed sectors according to your own classifications.

The app is hosted on Vercel.

Next step : handling browser cache to avoid multiple downloads of the models. This app does not currently handle browser cache so models are downloaded again at each new session. Trying to handle browser cache (disabled in worker.js for now) would lead the app to crash. Not quite sure if it is just a skill issue on our end or a real bug in the transformers.js library. Will update once we have worked it out.

Linked articles
We care about your privacy so we do not store nor use any cookie unless it is stricly necessary to make the website to work
Got it
Learn more