Principal investigator (ÚFAL): 
Project Manager (ÚFAL): 
Grant id: 

Large, pre-trained neural language models (LM) that can effectively utilize enormous amounts of unlabeled, textual data have recently changed the whole field of Natural Language Processing (NLP).

They are trained on sequences of tokens sampled from textual data, and during the training process are exposed to terabytes of data. In this project, we would like to focus on the autoregressive language models. During the pre-training, they are tasked to predict the next token x_i, given previous tokens x_0, x_1, ..., x_{i-1}.

It was observed, that thanks to the variety of textual data seen during training (books, news articles, scientific papers, etc), those models can perform a variety of NLP tasks when primed with only a handful of samples - no training in the classical sense (updating model weights) is required. For example, assuming that we have access to a set sentence pairs {s_i, t_i}, s_i being English sentences and t_i their translation into French, when prompted with the sequence of "s_1 in French means t_1 \n s_2 in French means t_2 \n ... s_k in French means " the models are capable of producing (in an autoregressive manner) the correct French translation of English sentence s_k. It was shown that other classical textual tasks such as Summarization (The summary of {} is {}), or Question Answering (Question: {} Answer: {}) are also solvable, with the correct prompt. Surprisingly, it is being reported that with the correct prompt, the results are competitive when compared with models trained in a supervised manner, using the labeled data.

It was shown, that given the correct prompt LMs can also do basic numerical reasoning. When prompted with a sequence of simple additions they are able to predict the correct value, e.g. "How much is 1+2? Answer: 3 \n How much 4+5 Answer: 9 ... How much is 6+7 Answer: " the model is able to predict the correct value of 13.

Our project is inspired by the recently published paper: "Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity" by Lu et al. 2022. In this paper, the authors show that the order in which the samples are ordered in the prompt can make a difference between a random guess and near state-of-the-art performance. In their experiments, they focus on classification tasks, such as sentiment classification or textual entailment.

The question we ask is whether this phenomenon also applies to mathematical expressions, i.e. whether the arithmetic operations in the space of language model prompts have the basic properties, such as the commutative property. One would expect that a system capable of conducting numerical reasoning would behave the same when prompted with “1+2+3” vs “2+3+1”. In addition, we plan to conduct a detailed analysis of the failed cases, trying to determine their reason.



Mateusz Krubiński (2023): Basic Arithmetic Properties in the Space of Language Model Prompts. In: The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS'23, New Orleans, USA (bibtex)