Robert Ritz is a data scientist and educator based in Ulaanbaatar.
A project to create a general Mongolian language model using the ULMFiT method. Allows transfer learning to get state of the art classification models for Mongolian.
Project Completed March 2021.
This project generates a language model suitable for fine-tuning and training a classifier. The ULMFiT method proposed by the Fast.ai co-authors Jeremy Howard and Sebastian Ruder is used (paper). This method has three steps:
All notebooks can be found on the Github repo here. Find the links for the data and completed models below.
You can used the linked model and vocabulary below to fine-tine on your own data and then build a text classifier. Using the method above, I was able to achieve state of the art results on a news classification task (using the Eduge dataset).
The benefits of ULMFiT (from my perspective):
Notebooks used the following versions:
Data is in the data folder. Google Drive link.
news
: Folder containing the original Mongolian Large News Corpus.20_news.txt
: 20% sample from the Mongolian Large News Corpus40_news.txt
: 40% sample from the Mongolian Large News Corpus. Not used in the language model due to memory issues.Result is a language model suitable for transfer learning on a variety of tasks. There are two folders containing completed models. One is SentencePiece Encoded and the other uses the default spaCy Encoder.
Pretrained language model (Google Drive Folder). Folder contains two model folders. Choose the model that corresponds to the encoder you want to use (SentencePiece or spaCy). spaCy is the default encoder, so use that if you aren’t sure.
mn_20_news_lm.pth
: Language model used to create your learner in the Fast.ai library for fine tuning.mn_20_news_vocab.pkl
: Vocabulary for the pre-trained language model. Imported alongside the model above when creating your learner.spm.model
: Sentence piece model. Can be used to set up your tokenizer for your dataloader. This seems to be optional as I was able to fine-tune without the sentence piece tokenizer when I tried.spm.vocab
: Vocabulary for the sentence piece tokenizer.Eduge classifer. Can be used for classifying news stories with the Fast.ai library.
eduge_classifier.pkl
: Pickled model.Since I used a large news corpus and trained the language model on it instead of a more “general” dataset like a Wikipedia corpus, I didn’t see an increase in my test case, Eduge classification, of 93.5% accuracy. It’s also possible that this is the top end of the accuracy possibility without other augmentations or a different architecture.
The sentence piece tokenizer (spt) is really really fast, and is definitely faster than the default tokenizer. It also seems that you can train a language model with the spt then fine tune with the default tokenizer. Although it would make sense to use the same tokenizer the whole way through.
I’ll be using this pre-trained model for testing other classification tasks soon.