ModernBERT-mlx
ModernBERT-mlx
2025-01-24 - Artificial Intelligence
Developer
URL

This project aimed at implementing ModernBERT in MLX. ModernBERT is a refreshed version of BERT models, with 8192 token context length, significantly better downstream performance, and much faster processing speeds. MLX is the tensor framework for Apple chips (Apple Silicon).

The repository allows to train and run MordernBERT models on Apple devices for classification and retrieval tasks.

Inference

MLX has inherent limitations compared to Pythorch (the framework used for the original implementation by AnswerAi) so this implementation does not fully leverage the architectural innovations compared to BERT. Nonetheless, this MLX implementation allows to run all ModernBERT models without any performance degradation except for speed.

This implementation of the ModernBERT architecture in MLX for inference was first prepared in December and was, at the time, the first MLX implementation.

Inspired by HuggingFace transformers, specific pipelines were created for different use-cases.

  • embeddings : simple embeddings of inputs
  • sentence-similarity : returns the similarity matrix, using cosine similarity, between input sequences and reference sequences
  • sentence-transformers : same output as sentence-similarity but guarantees compatibility with sentence transformers parameters as keys are specific
  • masked-lm : returns logits for all tokens in the input sequence. For now, filtering for the masked token and softmax are handled outside the pipeline (see tests_maskedlm.py).
  • text-classification : returns probabilities of labels for the sequence. Classification can be a regression, binary classification (untested for now) or multilabel classification. For multilabel the config file must contain an id2label dictionary.
  • token-classification (untested for now): returns probabilities of labels for each token in the sequence for names entity recognition tasks.

See the README file for information on models we tested and on how to run any ModernBERT model using pipelines.

Training

A dedicated trainer in MLX was built for this project. While the feature is still considered experimental, finetuning the base ModernBERT model for text classification tasks is possible with our modernbert-mlx repository. Tests have yielded encouraging results, albeit very slow compared to the transformers trainer. Further optimizing the architecture and the trainer will be part of the next steps of this project.

New dataset handling utils were also necessary for the training pipeline. Although the project leverages HuggingFace's Datasets library, an additional layer was necessary to adapt the Dataset outputs to our new trainer.

Next Steps

  • Continue work on training (PoC for sequence classification, still very preliminary)
  • train classifiers and sentence embeddings model to check Model, ModelForSentenceTransformers, ModelForSequenceClassification and ModelForTokenClassification
  • clean the code and improve consistency across model classes for inference and training
  • write documentation
  • add other models that are relevant for these tasks (stella, bert, xml-roberta...)

Inspiration

  • HuggingFace transformers, instrumental to this project
  • MLX Examples by Apple, is the source of the utils for this project (see licence in the repository)
  • mlx-embeddings by Prince Canuma whose project, supporting BERT and xml-roberta, has been more than helpful to get started with this one. I've worked on BERT and xml-roberta in a fork of this project. The plan is to add the stella architecture there too.
We care about your privacy so we do not store nor use any cookie unless it is stricly necessary to make the website to work
Got it
Learn more