Abstract
Quiz games is the type of intellectual competition which are well suited for testing LLMs reasoning and problem solving skills. Indeed, a good quiz puzzle requires not only factual knowledge, but also the ability to analyze clues given in question, generate hypothesis, and choose the best one using logical reasoning and subtle hints. Recently, modern LLMs have made significant progress in general reasoning tasks, making this kind of evaluation extremely interesting. In this paper, we address a major limitation in the current LLMs' assessment: the models are usually evaluated on English language, or on the multi-lingual benchmarks reflecting English-centric culture, obtained by the translation from the English originals. In the contrary, we test the ability of the modern LLM to deal with the questions of real human quiz games from non-English-speaking society. Namely, we apply LlaMa3-405B to solve the quiz tasks created by the "What?Where?When?" Russian-speaking intellectual gaming community. First, we show, that although the LLM demonstrates strong reasoning and linguistic proficiency in Russian language, the performance diminishes significantly because of the poor knowledge of culture-specific facts. Second, we show the importance of the reasoning strategy choice for answering medium-difficulty questions, for which the model "posses" the necessary knowledge, but the correct answer cannot be given immediately. Evaluating several single- and multi-agent approaches, we obtain 6\% improvement in the overall accuracy comparing to the baseline step-by-step reasoning.