LlaMa meets Cheburashka: impact of cultural background for LLM quiz reasoning

16 December 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Quiz games is the type of intellectual competition which are well suited for testing LLMs reasoning and problem solving skills. Indeed, a good quiz puzzle requires not only factual knowledge, but also the ability to analyze clues given in question, generate hypothesis, and choose the best one using logical reasoning and subtle hints. Recently, modern LLMs have made significant progress in general reasoning tasks, making this kind of evaluation extremely interesting. In this paper, we address a major limitation in the current LLMs' assessment: the models are usually evaluated on English language, or on the multi-lingual benchmarks reflecting English-centric culture, obtained by the translation from the English originals. In the contrary, we test the ability of the modern LLM to deal with the questions of real human quiz games from non-English-speaking society. Namely, we apply LlaMa3-405B to solve the quiz tasks created by the "What?Where?When?" Russian-speaking intellectual gaming community. First, we show, that although the LLM demonstrates strong reasoning and linguistic proficiency in Russian language, the performance diminishes significantly because of the poor knowledge of culture-specific facts. Second, we show the importance of the reasoning strategy choice for answering medium-difficulty questions, for which the model "posses" the necessary knowledge, but the correct answer cannot be given immediately. Evaluating several single- and multi-agent approaches, we obtain 6\% improvement in the overall accuracy comparing to the baseline step-by-step reasoning.

Keywords

LLM
AI
ML
Reasoning

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.