I did a few tests today using the latest AI models, GPT-4o and Sonnet 3.5. I collected a set of 105 Mongolian language and 136 social science questions from the Mongolian high school exit exam (ЭЕШ exam). I then asked each of these models to answer the questions.

The last time I tested AI models was in March 2024, where I tested GPT-4 Turbo and Claude Opus. I added those for comparison.

Here are the results:

AI models are now ahead of the average performance of humans on basic question answering tasks in Mongolian. I believe the performance of these models will exceed even high level abilities of humans in the very near future.

I’m amazed at the progress of just a few months. When I first started ErdemAI I thought it would take years for the big AI companies to include Mongolian in a meaningful way. In less than 6 months we now have 2 very high quality models to choose from.

Personally I prefer the Anthropic models. They feel more human and write more naturally. I’m in the process of moving the systems for ErdemAI from GPT-4o to Sonnet 3 today, and I don’t believe I will switch back to OpenAI in the near future (but who knows).

A side note, but important here, is that the ЭЕШ exam has some issues. Social science is heavily weighted to memorizing facts. The Mongolian language exam relies on the test takers ability to reason with extremely complex and dense Mongolian grammar unlike anything you would see in the normal course of life. So while these results are impressive, it’s hard to connect this to real life.

Hopefully more real life LLM uses will be forthcoming in Mongolia.