Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
2021-12-20
-
Research,
Information Processing | Computing,
Artificial Intelligence
Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot and Samson Tan provide a historical overview of open-vocabulary modeling and tokenization in NLP. They highlight the shift from word-based to subword-based approaches like byte-pair...