PITTI - Blog - ModernBERT-mlx

ModernBERT-mlx

ModernBERT-mlx

2025-01-24 - Artificial Intelligence

Developer

URL

https://github.com/pappitti/modernbert-mlx

This project aimed at implementing ModernBERT in MLX. ModernBERT is a refreshed version of BERT models, with 8192 token context length, significantly better downstream performance, and much faster processing speeds. MLX is the tensor framework for Apple chips (Apple Silicon).

The repository allows to train and run MordernBERT models on Apple devices for classification and retrieval tasks.

Inference

MLX has inherent limitations compared to Pythorch (the framework used for the original implementation by AnswerAi) so this implementation does not fully leverage the architectural innovations compared to BERT. Nonetheless, this MLX implementation allows to run all ModernBERT models without any performance degradation except for speed.

This implementation of the ModernBERT architecture in MLX for inference was first prepared in December and was, at the time, the first MLX implementation.

Inspired by HuggingFace transformers, specific pipelines were created for different use-cases.

embeddings : simple embeddings of inputs
sentence-similarity : returns the similarity matrix, using cosine similarity, between input sequences and reference sequences
sentence-transformers : same output as sentence-similarity but guarantees compatibility with sentence transformers parameters as keys are specific
masked-lm : returns logits for all tokens in the input sequence. For now, filtering for the masked token and softmax are handled outside the pipeline (see tests_maskedlm.py).
text-classification : returns probabilities of labels for the sequence. Classification can be a regression, binary classification (untested for now) or multilabel classification. For multilabel the config file must contain an id2label dictionary.
token-classification (untested for now): returns probabilities of labels for each token in the sequence for names entity recognition tasks.

See the README file for information on models we tested and on how to run any ModernBERT model using pipelines.

Training

A dedicated trainer in MLX was built for this project. While the feature is still considered experimental, finetuning the base ModernBERT model for text classification tasks is possible with our modernbert-mlx repository. Tests have yielded encouraging results, albeit very slow compared to the transformers trainer. Further optimizing the architecture and the trainer will be part of the next steps of this project.

New dataset handling utils were also necessary for the training pipeline. Although the project leverages HuggingFace's Datasets library, an additional layer was necessary to adapt the Dataset outputs to our new trainer.

Next Steps

Continue work on training (PoC for sequence classification, still very preliminary)
train classifiers and sentence embeddings model to check Model, ModelForSentenceTransformers, ModelForSequenceClassification and ModelForTokenClassification
clean the code and improve consistency across model classes for inference and training
write documentation
add other models that are relevant for these tasks (stella, bert, xml-roberta...)

Inspiration

HuggingFace transformers, instrumental to this project
MLX Examples by Apple, is the source of the utils for this project (see licence in the repository)
mlx-embeddings by Prince Canuma whose project, supporting BERT and xml-roberta, has been more than helpful to get started with this one. I've worked on BERT and xml-roberta in a fork of this project. The plan is to add the stella architecture there too.

Linked articles

WEB - 2024-12-18

MordernBERT, finally a replacement for BERT

6 years after the release of BERT, answer.ai introduce ModernBERT, bringing modern model optimizations to encoder-only models a...

WEB - 2024-12-30

Fine-tune ModernBERT for text classification using synthetic data

David Berenstein explains how to finetune a ModernBERT model for text classification on a synthetic dataset generated from argi...

WEB - 2024-12-25

Fine-tune classifier with ModernBERT in 2025

In this blog post Philipp Schmid explains how to fine-tune ModernBERT, a refreshed version of BERT models, with 8192 token cont...

Inference

This implementation of the ModernBERT architecture in MLX for inference was first prepared in December and was, at the time, the first MLX implementation.

Inspired by HuggingFace transformers, specific pipelines were created for different use-cases.

embeddings : simple embeddings of inputs

sentence-similarity : returns the similarity matrix, using cosine similarity, between input sequences and reference sequences

sentence-transformers : same output as sentence-similarity but guarantees compatibility with sentence transformers parameters as keys are specific

masked-lm : returns logits for all tokens in the input sequence. For now, filtering for the masked token and softmax are handled outside the pipeline (see tests_maskedlm.py).

text-classification : returns probabilities of labels for the sequence. Classification can be a regression, binary classification (untested for now) or multilabel classification. For multilabel the config file must contain an id2label dictionary.

token-classification (untested for now): returns probabilities of labels for each token in the sequence for names entity recognition tasks.

See the README file for information on models we tested and on how to run any ModernBERT model using pipelines.

Training

Next Steps

Continue work on training (PoC for sequence classification, still very preliminary)

train classifiers and sentence embeddings model to check Model, ModelForSentenceTransformers, ModelForSequenceClassification and ModelForTokenClassification

clean the code and improve consistency across model classes for inference and training

write documentation

add other models that are relevant for these tasks (stella, bert, xml-roberta...)

Inspiration

HuggingFace transformers, instrumental to this project

MLX Examples by Apple, is the source of the utils for this project (see licence in the repository)

mlx-embeddings by Prince Canuma whose project, supporting BERT and xml-roberta, has been more than helpful to get started with this one. I've worked on BERT and xml-roberta in a fork of this project. The plan is to add the stella architecture there too.