MULTI-VIEW RANKING: TASKING TRANSFORMERS TO GENERATE AND VALIDATE SOLUTIONS TO MATH WORD PROBLEMS School of Computer Science & Applied Mathematics University of the Witwatersrand Rifumo Mzimba 1619542 Supervised by Prof. Richard Klein Prof. Benjamin Rosman October 29, 2023 A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, South Africa, in fulfillment of the requirements for the degree of Master of Science. Abstract The recent developments and success of the Transformer model have resulted in the creation of massive language models that have led to significant improvements in the comprehension of natural language. When fine-tuned for downstream natural lan- guage processing tasks with limited data, they achieve state-of-the-art performance. However, these robust models lack the ability to reason mathematically. It has been demonstrated that, when fine-tuned on the small-scale Math Word Problems (MWPs) benchmark datasets, these models are not able to generalize. Therefore, to overcome this limitation, this study proposes to augment the generative objective used in the MWP task with complementary objectives that can assist the model in reasoning more deeply about the MWP task. Specifically, we propose a multi-view generation objective that allows the model to understand the generative task as an abstract syntax tree traversal beyond the sequential generation task. In addition, we propose a complementary verification objective to enable the model to develop heuristics that can distinguish between correct and incorrect solutions. These two goals comprise our multi-view ranking (MVR) framework, in which the model is tasked to generate the prefix, infix, and postfix traversals for a given MWP, and then use the verification task to rank the generated expressions. Our experiments show that the verification objective is more effective at choosing the best expression than the widely used beam search. We further show that when our two objectives are used in conjunction, they can effectively guide our model to learn robust heuristics for the MWP task. In particular, we achieve an absolute percentage improvement of 9.7% and 5.3% over our baseline and the state-of-the-art models on the SVAMP datasets. Our source code can be found on https://github.com/ProxJ/ msc-final. i https://github.com/ProxJ/msc-final https://github.com/ProxJ/msc-final Declaration I, Rifumo Mzimba, hereby declare the contents of this research proposal to be my own work. This proposal is submitted for the degree of Master of Science in Computer Science at the University of the Witwatersrand. This work has not been submitted to any other university or for any other degree. ii Signature Date Acknowledgements Completing this dissertation has been one of the most challenging and rewarding ex- periences of my academic career. To everyone who helped me along the way, I want to express my sincere gratitude and appreciation. To begin, I’d like to express my gratitude to my advisors, Profs. Richard Klein and Benjamin Rosman, for their expertise, patience, and direction throughout my research. Their suggestions and criticisms helped me shape and improve my ideas, and their unflinching encouragement kept me going. I would also like to thank my colleagues and friends who were there for me when I re- ally needed their encouragement. A particular shout out to my professional colleagues who took on extra responsibilities to allow me to focus on writing this dissertation. I owe a great debt of gratitude to my family for their unending love, encouragement, and patience throughout this process. I couldn’t have gotten through this period with- out their unwavering love, encouragement, and support. I would also like to extend my gratitude to the rest of the academic staff who probed interesting questions and made interesting suggestions, the MSL and team for providing computational resources, and the sponsors who funded my studies. I would like to acknowledge the difficulties I faced during the research process. With ambitious plans for radical advances, many ideas were tried but proved fruitless. This experience has given me a newfound respect for academic research and taught me the value of tenacity, persistence, and asking for assistance when one is stuck. As a final note, I’d like to express my appreciation to the academic community at large for the work they’ve done in the field. I have been able to shape my research and hone my arguments thanks to the plethora of literature and research I have had at my disposal. In conclusion, I am immensely grateful to everyone who helped me along the way. Your support, encouragement, and insights have been invaluable, and I am truly humbled by your generosity and kindness. To God be the glory for giving me the fortitude, patience, and perseverance to finish this study. iii Contents Preface Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction 1 2 Objective 7 2.1 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Research Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Research Questions for the Multi-View Generation Task . . . . . . 8 2.3.2 Research Questions for the Verification Task . . . . . . . . . . . . 8 3 Background and Related Work 9 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2.1 Background (Recurrent Neural Networks) . . . . . . . . . . . . . 10 3.2.2 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.3 Input Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.4 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.5 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.6 Position-wise Feedforward Neural Network (FNN) . . . . . . . . 17 3.2.7 LayerNorm and Residual Connections . . . . . . . . . . . . . . . 17 3.2.8 Final layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Distance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.3 Miners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.1 Single Token pooling . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.2 Max-pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv 3.4.3 Average Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.4 Attentive Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.5 Mixed Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5.2 Sequential Generation Task . . . . . . . . . . . . . . . . . . . . . 32 3.6 Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6.2 Multi-task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6.3 Multi-view Learning . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6.4 Multi-Model Learning . . . . . . . . . . . . . . . . . . . . . . . . 36 3.7 Math Word Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7.1 Mathematics Laws . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7.2 Math Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8.1 Encoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . 41 3.8.2 Decoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . 42 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4 Data Representation and Multi-view Learning 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 Mathematics Expression Notation . . . . . . . . . . . . . . . . . . 49 4.2.2 Conditional Generation . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.3 Quantity Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.4 Chain of Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.5 Generation Consistency . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.3 Quantity Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.4 Solution Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.5 Generation Consistency . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.4 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.6 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.7 Expression Notation . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4.8 Multi-view Notation . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.1 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 v 5 Multi-tasking Multi-view Decoders 69 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.1 Multi-tasking Models . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.2 Multi-Decoder Models . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.3 Verifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.1 Multi-view Generation . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.4 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.5 Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4.6 Multi-view and Verification . . . . . . . . . . . . . . . . . . . . . 87 5.4.7 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6 Conclusion 92 References 119 vi List of Figures 1.1 Many-to-many mapping between questions and solutions. . . . . . . . . 2 1.2 Impact of minor changes on questions. . . . . . . . . . . . . . . . . . . . 2 3.1 Transformer architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Transformer - multi-head attention architecture . . . . . . . . . . . . . . 17 3.3 An example of expression trees . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 An example of the proposed masking technique. . . . . . . . . . . . . . . 47 4.2 Examples of different quantity masking techniques. . . . . . . . . . . . . 54 4.3 The proposed token-level view-consistency strategy. . . . . . . . . . . . . 57 4.4 The proposed multi-view BART model. . . . . . . . . . . . . . . . . . . . 57 4.5 Divergence of context representations during training . . . . . . . . . . . 64 5.1 The proposed multi-view ranking (MVR) model. . . . . . . . . . . . . . . 72 5.2 Examples of expression tree templates . . . . . . . . . . . . . . . . . . . 78 vii List of Tables 4.1 Notations and quantity masks used to solve math word problems. . . . . 48 4.2 Dataset statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Overall results of the multi-view model compared to baseline models. . . 63 4.4 Performance of each view in the multi-view setup. . . . . . . . . . . . . . 64 4.5 Impact of using a different number of views and view combinations. . . 65 5.1 Overall results of our Multi-view Ranker (MVR). . . . . . . . . . . . . . . 84 5.2 Ablation results for different model setups. . . . . . . . . . . . . . . . . . 87 5.3 Model performance under different learning objectives. . . . . . . . . . . 88 5.4 Qualitative analysis of the models . . . . . . . . . . . . . . . . . . . . . . 89 viii Chapter 1 Introduction Solving mathematical word problems is one of the fundamental math concepts taught in elementary school. Math Word Problems (MWP) are mathematical questions that are primarily expressed in everyday language rather than using mathematical symbols. They frequently contain a scenario and a query related to the mathematical relation- ship between the quantities or objects in the scenario. One needs cognition, language comprehension, working memory, and attention to solve this [Lin 2021; Verschaffel et al. 2020; Kula 2007]. MWPs are an interesting benchmark for testing the reasoning ability of language mod- els, similar to how humans use them [Scheibling-Sève et al. 2020]. In particular, MWP can prove useful in testing pre-trained language models that have demonstrated similar language comprehension as humans [Scheibling-Sève et al. 2020; Cobbe et al. 2021; Wei et al. 2022; He et al. 2020b]. The resolution of MWPs is a significant step towards the development of general intelli- gence [Fletcher 1985; Wei et al. 2022]. Furthermore, these models can be incorporated into virtual assistants, search engines, and Computer Algebra Systems to make these systems more robust, and to help students develop problem-solving skills [Scheibling- Sève et al. 2020; Morton and Qu 2013; Jeschke and Richter 2007; Kapralov et al. 2020; Zhang et al. 2020a]. The ability of language models to solve symbolic math problems has been shown by nu- merous researchers, demonstrating their aptitude for mathematics [Lample and Char- ton 2019; Saxton et al. 2019; Noorbakhsh et al. 2021]. Symbolic math problems in- volve manipulating mathematical expressions using symbolic notation to find a solu- tion. These are often used in advanced mathematics such as calculus, algebra, and dif- ferential equations. While MWPs require simpler arithmetic operations, they remain an open problem [Patel et al. 2021; Zhang et al. 2020a]. This indicates that the challenge MWPs present is textual comprehension rather than purely mathematical [Morton and Qu 2013]. Most NLP tasks require semantic comprehension, however, MWPs also need mathemat- ical logic to arrive at the final solution. The mapping between the question and solution spaces is many-to-many and unrestricted. Questions that have different semantic mean- 1 ( number0 + number1 ) * number2 bryan took a look at his books and magazines . if he has number0 books and number1 magazines in each of his number2 bookshelves how many books and magazines does he have in total melissa scored number0 points in each game . she also got number1 bonus points in each game . how many points did she score in number2 games the razorback t-shirt shop sells each t-shirt for $ number0 dollars . during the arkansas and texas tech game they increased the prices by $ number1 per t-shirt and sold number2 t-shirts . how much money did they make from selling the t-shirts ? number0 * number2 + number1 * number2 number2 * ( number1 + number0 ) Distribution Associativity Original Figure 1.1: Many-to-many mapping between questions and solutions. Questions have different contexts and semantics but the same ground truth solution. Questions ex- tracted from SVAMP [Patel et al. 2021]. number2 / ( number0 + number1 + number2 ) * 100 In a bag there are number0 red marbles , number1 blue marbles , and number2 green marbles . What percent of the marbles are red ? In a bag there are number0 red marbles , number1 blue marbles , and number2 green marbles . What percent of the marbles are green ? number1 / ( number0 + number1 + number2 ) * 100 In a bag there are number0 magenta marbles , number1 blue marbles , and number2 green marbles . What percent of the marbles are green ? number2 / ( number0 + number1 + number2 ) * 100 minor change, same solution minor change, different solution (A) (B) Figure 1.2: A minor change in the question can lead to a completely different solution. Questions extracted from SVAMP [Patel et al. 2021]. ings can have the same solution. Due to the laws of distribution and associativity in mathematical expressions, a single question can have multiple correct solutions. Figure 1.1 shows examples with significant semantic and structural differences and yet have the same solution. The model must learn a feature space robust enough to be invariant of these semantic distinctions. It is also possible that a minor change in the problem statement could result in a completely different solution. For example, in Figure 1.2 (A), we can see that changing only one word, the “green” in the question with “red” leads to a different solution. However, substituting the word “red” with any color other than “green” will lead to the same solution. Similarly, a minor change in the solution may or may not lead to a different solution. For example, swapping values is permissible during addition and multiplication; however, results in errors during subtraction and division [Sundaram et al. 2022]. 2 Given the challenges and utility of MWPs, efforts to automatically solve MWPs date back to the 1960s [Bobrow 1964; Charniak 1969; Feigenbaum and Feldman 1963]. Manually crafted rules were developed to match the MWPs with the corresponding mathematical expression that resolves to the answer [Bakman 2007; Mukherjee and Garain 2008; Yuhui et al. 2010]. For these strategies to be generalizable, the crafted rules must be exhaustive to cover every possible MWP in an unrestricted language. This is not feasible in the real world, limiting the number of applications for such a system [Li et al. 2022a; Sundaram et al. 2022]. The subsequent research adapted the use of statistical and machine learning techniques to automatically learn the rules relating the MWP to its solution [Kwiatkowski et al. 2013; Goldwasser and Roth 2014]. In this setup, an MWP dataset is used to teach a model the relationships between the query values and the background information of the math problems. The model generates the relations as mathematical equations that answer the query [Zhang et al. 2020a]. Although this configuration is an improvement over the initial strategy, the models can only solve a small number of problems that are highly similar to the training data. The limitation of the statistical methods has led to the development of deep learning methods to solve MWPs [Wang et al. 2017]. Deep learning is capable of solving sev- eral mathematical reasoning challenges [Lample and Charton 2019; Saxton et al. 2019; Henighan et al. 2020; Charton et al. 2021]. However, the datasets used are relatively large in comparison to the datasets in the MWP domain. The cost of quality anno- tated MWPs data is greater than that of synthetically generated symbolic mathematics [Upadhyay et al. 2016; Lample and Charton 2019; Saxton et al. 2019; Hendrycks et al. 2021; Patel et al. 2021]. To deal with limited data that have complex relations to be learned, transfer learning has been successfully applied in natural language processing (NLP) to achieve state-of- the-art (SOTA) performance on small datasets [Dai and Le 2015; Devlin et al. 2018; Lewis et al. 2019; Malte and Ratadiya 2019]. This learning paradigm has been shown to outperform models that are only trained directly on the MWP datasets [Sundaram et al. 2022; Lan et al. 2022]. While using these models or parts thereof increases performance, the models learn shallow heuristics that are sensitive to minor changes in the math problems [Patel et al. 2021; Shen et al. 2021]. A big leap forward in the application of transfer learning in NLP came with the Trans- former architecture [Vaswani et al. 2017]. This allowed researchers to pre-train lan- guage models on available large datasets. These models can further be fine-tuned on smaller datasets that exhibit similar properties to the pre-training objective. Particularly in the MWP domain, pre-trained language models (PLMs) have been shown to outperform models that are only trained directly on the MWP datasets [Sundaram et al. 2022; Lan et al. 2022; Patel et al. 2021]. Although PLMs elicit better performance, these models learn shallow heuristics that are sensitive to small changes in math prob- lems [Patel et al. 2021]. Several strategies have been proposed to make deep learning models learn MWPs more effectively. Particularly, the sequential encoding and decoding in pre-trained Transform- 3 ers struggle to capture the structural information that MWPs exhibit. Graph encoders [Li et al. 2020; Zhang et al. 2020c], along with hierarchical tree decoders [Xie and Sun 2019] and bottom-up Directed Acyclic Graphs (DAGs) [Cao et al. 2021; Jie et al. 2022; Li et al. 2022a] introduce architectural modifications to incorporate structural information originating from MWPs through of. The changes made to the encoder/decoder architecture of the PLMs enforce that the model learns the mathematical relationships between the quantities in a structural manner. This is something that traditional sequential architecture does not explicitly require. The architectural modifications, however, are slower to train and too special- ized for mathematical problems. The goal of this research is to see if the sequence-to- sequence Transformer can learn deeper heuristics and structural information by using learning objectives rather than specialized architectural changes. Specifically, we propose a multi-view generation objective to learn structural informa- tion and a verification objective to learn deeper heuristics. Our multi-view generation objective is to learn the prefix, infix, and postfix notations of the ground truth expres- sion tree of a given MWP. We show that a model trained to generate multiple views of the abstract syntax tree (AST) demonstrates deeper heuristics than a model that is trained to generate only one view. Furthermore, the ability to generate multiple views that evaluate to the same solution shows that the model has some understanding of the structural properties of mathematical expressions. The multi-view objective has been proposed in the literature [Zhang et al. 2022a; Shen and Jin 2020]. These proposals use two views and different architectures for the decoders. Different to these, we use three views and sequential pre-trained decoders. Our verification objective, inspired by Shen et al. [2021], trains our model to mark MWP solutions as correct or incorrect. We show that a model that can generate and validate expressions has a deeper understanding of the MWP task and does not use shallow heuristics that generate tokens that could potentially yield the correct solutions without fully understanding the MWP. With our multi-view generation objective and a verification objective, we propose a new multi-view ranking (MVR) framework, where we not only treat these two objectives as auxiliary tasks but further consider their output to choose the final solution. In particular, given an MWP, our multi-view generation task makes the prefix, infix, and postfix views of the AST solution, and we use the correctness score from the verification task to choose the best view. We conducted several experiments throughout to show the effectiveness of our strate- gies. We implement a baseline BART model which achieves an accuracy of 40.6% on the SVAMP [Patel et al. 2021] dataset, while the SOTA accuracy is at 45.0%. Our implemen- tation of the multi-view objective improves the baseline’s generation accuracy by 2.3% while the verification task improves it by 6.0%. Furthermore, we run experiments that show that using these two objectives to generate the final solution results in an overall accuracy of 50.3%, an accuracy improvement of 9.7% from our baseline, and 5.3% on the current SOTA results for SVAMP. In summary, the contributions of this work are as follows: 4 • We propose a multiview generation objective, where our model has to generate the prefix, infix, and postfix notations of the solution to MWPs. This objective exposes the model’s misunderstanding of the MWP, which would otherwise be concealed by a single-generation objective. • We also propose refinements to Shen et al. [2021]’s proposed verification objec- tive, rephrasing it as a metric learning task rather than a classification task. This objective determines whether the model can distinguish between correct and in- correct solutions by embedding the correct solutions closer to the MWP than the incorrect solutions. • We also propose a multi-view ranking framework that uses the proposed multi- view generation task to generate multiple candidate solutions and the verification task to choose the best candidate solution. This framework enables the model to produce a variety of potential solutions, which the verification task will evaluate before choosing the best one. • We conduct experiments that show that our proposed model is capable of cor- rectly answering a greater number of MWPs than previous models that were con- sidered to be SOTA. Having outlined our main contributions, we give a brief overview of the structure of this work. Chapter 2 is dedicated to an in-depth discussion of the research objectives. This chapter unfolds as follows: Section 2.1 offers a comprehensive exploration of the re- search objectives, elucidating the core goals and intentions of this study. Following this, Section 2.2 formulates and presents the research hypothesis, laying the groundwork for the central thesis that this research endeavors to test and validate. Subsequently, in Sec- tion 2.3, we systematically outline the research questions that serve as the investigative pillars of this work. These questions provide a strategic framework for our inquiry, guiding us in our quest to unravel the complexities and nuances of our chosen subject matter. Chapter 3 lays the theoretical foundations of our work. This chapter starts with a brief introduction in Section 3.1. We then discuss the Transformer architecture used for all of our experiments in Section 3.2. Sections 3.3 and 3.4 discuss metric learning and the pooling mechanism, respectively. These concepts are useful for our verification objective. Thereafter, we detail NLP concepts and learning paradigms applied in our research in Sections 3.5 and 3.6 respectively. Section 3.7 briefly discusses MWPs, ex- pression notations, and expression properties. Thereafter, Section 3.8 relates the Trans- former architecture to other architectures proposed in the literature. The background information is then summarized in Section 3.9. We present our proposed multi-view and verification objectives in Chapters 4 and 5, respectively. In particular, Chapters 4 and 5 are structured as independently and self- contained as possible. These two chapters follow a similar structure: they start with an introduction and motivation section (Sections 4.1 and 5.1); which is followed by their respective related work (Sections 4.2 and 5.2); the methodology of the chapters then follows (Sections 4.3 and 5.3); then the experiments and results are discussed (Sections 4.4 and 5.4); and a conclusion section summarizes the chapter and discusses 5 future work (Sections 4.4 and 5.4). In particular Chapter 4 aims to explore how data representation impacts the perfor- mance of Transformers to solve MWPs. It focuses on quantity masking and the expres- sion notation used in the MWP domain. We propose two major changes: a random indicator mask and generating three notations/views instead of one. We find that the multi-view notation setup requires a verifier to realize its full potential. Chapter 5 sets out to investigate the verification objective and integrate it with the multi-view objec- tive which results in our multi-view ranker (MVR) model. Having presented ablation studies on the data representations, and our proposed model along with its benchmark performance, we conclude our work in chapter 6. Further- more, we outline the direction for future work based on ours. 6 Chapter 2 Objective 2.1 Research Objective The primary aim of this research is to enrich our comprehension of the strategies em- ployed to bolster the performance of pre-trained Transformers in the realm of logical tasks, with a specific focus on solving Math Word Problems (MWPs). The overarching objective is to augment the model’s problem-solving abilities by introducing auxiliary learning tasks, thereby enhancing the heuristics acquired, while refraining from the necessity of devising specialized neural architectures. In pursuit of this objective, this study introduces two essential learning tasks: the multi-view generation task and the verification task. The multi-view generation task is designed to enable the model to comprehend the abstract syntax tree (AST) underlying MWPs. This objective necessitates the model to perceive the solution not through a single sequential lens, but rather through three distinct views, which mirror the essential components of an AST. The model’s ability to generate these three views not only indicates its understanding of the structural relationships within the problem but also its capacity to tackle MWPs with a more structural and semantic approach. Simultaneously, the verification task is introduced to help the model discern correct solutions from incorrect ones. This auxiliary task equips the model with the ability to identify multiple correct solutions in MWPs, which are rooted in the underlying mathematical principles governing the expressions. 2.2 Research Hypothesis Our research is underpinned by the following hypothesis: The performance of the pre- trained BART model in solving MWPs can be significantly improved by integrating a multi-view generation and verification objective. Moreover, these auxiliary tasks can be instrumental in generating the final solution, where the verification task serves as a mechanism for selecting the most suitable solution from the multiple generated views. 7 2.3 Research Questions 2.3.1 Research Questions for the Multi-View Generation Task The core research questions addressed in Chapter 4 revolve around the multi-view generation task, these include: • Which of the quantity masking techniques available in the existing literature proves most effective when applied to the pre-trained BART model? • To what extent can the BART model autonomously generate the prefix, infix, and postfix views, both individually and collectively, and how does its performance compare in these scenarios? • Is it more advantageous for the BART model to generate all three views in unison, or is it more beneficial to condition the model to generate a specific view based on a given prompt? • Does training the BART model to generate consistent state vectors for terms in mathematical expressions that correspond to the same nodes in the AST result in improved performance? 2.3.2 Research Questions for the Verification Task In Chapter 5, the research focus shifts to the verification task, and the following ques- tions are explored: • How does the verification objective perform as an auxiliary task when compared to the baseline BART model and the multi-view objective? Additionally, what is the combined performance of these two objectives? • Can the verification task be harnessed to re-rank expressions generated through beam-search by the BART model based on their correctness, thereby enhancing overall accuracy? How does this strategy compare to a voting mechanism that selects the most consistent solution from the beams? • Does the treatment of the verification objective under a metric learning paradigm yield superior results compared to a classification paradigm in terms of perfor- mance and accuracy? By addressing these research questions, this dissertation aims to advance the field’s understanding of leveraging auxiliary tasks to improve the performance of pre-trained Transformers on logical tasks, with a specific focus on Math Word Problems. 8 Chapter 3 Background and Related Work 3.1 Introduction In this chapter, we introduce the theoretical foundation for this thesis. We start by dis- cussing the Transformer architecture as an advancement over the traditional sequence- to-sequence architecture in Section 3.2. This forms the basis of our model. Section 3.3 discusses metric learning and how it can be used to rank and learn better represen- tations. This research uses metric learning for verification and to enforce consistency in the multi-view generation objective. Before we can pass the representations to the verifier, we need to pool from the model representations. Strategies that can be used are discussed in Section 3.4. Section 3.5 discusses key concepts of Natural Language Processing (NLP), including tokenization, greedy search, and beam search. Section 3.6 briefly discusses learning paradigms and how they can improve the model’s ability to learn the task at hand. These tasks include transfer learning, multitask learning, multi-view learning, and multi-model learning. Section 3.8 gives an overview of the various architectures found in the literature that solve MWPs. We discuss these architectures and learning objectives in relation to the Transformer. In section 3.9 we briefly summarize the theoretical background presented in this chapter. 3.2 Transformers The Transformer is an attention-based neural architecture designed to address sequence- to-sequence modeling tasks while also dealing with long-range dependencies. It is the current state-of-the-art technique in the field of Natural Language Processing and Com- puter Vision [Lin et al. 2022; Khan et al. 2022; Dosovitskiy et al. 2020; Carion et al. 2020]. This section starts with a discussion of the recurrent structures that were used to process sequential data before the Transformer architecture introduced in Section 3.2.1. After 9 that, we provide an architectural summary of the Transformer in Section 3.2.2. A detailed architectural breakdown of the critical parts related to our work is given in the subsections that follow. In particular, Section 3.2.3 and Section 3.2.4 discuss the input and positional embeddings, respectively. Thereafter, Section 3.2.5 discusses the attention mechanism used in Transformers. 3.2.1 Background (Recurrent Neural Networks) Before the introduction of the Transformers, most state-of-the-art NLP solutions were based on gated recurrent neural networks (RNNs), such as Long short-term memory (LSTMs) [Gers et al. 2000; Wen et al. 2015; Chen et al. 2016; Wang and Jiang 2015] and gated recurrent units (GRUs) [Cho et al. 2014; Chung et al. 2014; Kumar et al. 2016; Zhang et al. 2018]. The vanilla RNNs for sequence-to-sequence modeling encode the entire input into one context vector that the decoder uses to generate the desired output. This restricts the amount of information the decoder can use and, as a result, the performance of the RNNs. An attention mechanism was introduced to allow the decoder to create a context vector by choosing which encoder states to attend to at every time stamp [Wang et al. 2016b; Liu et al. 2017; Irie et al. 2016]. This improved the RNN’s accuracy, however, not its speed during training. Transformers were introduced to improve the speed of sequential data modelling. They replace all the recurrent structures in RNNs with attention mechanisms. Vaswani et al. [2017] showed that provided with enough training data, attention mechanisms alone can match the performance of RNNs with attention. This eliminated the need to se- quentially process data during training and allowed researchers to efficiently use par- allel computing architectures. As such, larger models can be trained on large available datasets at a lower time complexity. Sequential Processing The Feedforward Neural Networks (FNN) are the first and simplest type of layers of artificial neurons devised [Haykin 1998; Zell 1994; Schmidhuber 2015] where infor- mation flows in one direction only. However, natural language presents the challenge of modeling sequential data of uneven lengths. To resolve this, recurrent neural net- works (RNNs) were introduced. These architectures use a feedback loop to enable data to be recirculated into the input before being sent again for additional processing and final output. The feedback loop allows RNNs to maintain a context vector that contains information on the data seen before the current step. At step n, the model combines the context state representing the sequence up to step n − 1 with the input at n to create a new context state that represents the sentence up to the current step n. Challenges with RNNs Theoretically, the information from each step can propagate arbitrarily far down the sequence if at every point the state continues to encode contextual information about 10 the step. In practice, this mechanism is flawed: the vanishing gradient problem leaves the model’s state at the end of a long sentence without precise, learnable information about the preceding sequence. The influence of the earlier steps on the later steps decreases as the sequence length increases. This causes vanilla RNNs to suffer from short-term memory. Gated RNNs A common way to deal with gradient instabilities is through gates. The gates control how the context state is updated at each step. Each step can either add, delete, or leave the context vector unchanged. This reduces the repeated multiplications caused by the feedback loop. As a result, gated RNNs can model longer-term dependencies. Two commonly used gated RNNs are Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber 1997; Gers et al. 2000] and Gated Recurrent Unit (GRU) [Cho et al. 2014; Chung et al. 2014]. Unresolved challenges by gated RNNs Gated RNNs achieved state-of-the-art results before the introduction of Transformers [Salehinejad et al. 2017]. However, these have several drawbacks. The computation of each step depends on the result of the computation of the previous step. This makes training RNNs inefficient, as they cannot be parallelized on modern deep-learning hardware. The other challenge with RNNs is that future input information cannot be reached from the current state. To resolve this, two RNNs are chained in opposite directions to form a bi-directional RNN [Bahdanau et al. 2014]. However, these RNNs are independent, which restricts the contextual information encoded for each step. The Transformer architecture addresses these challenges by replacing the recurrences in these networks with attention. Attention The attention mechanisms allow a model to create a context vector by drawing from all states at any point along the sequence. The attention layer has access to all previous states and can weigh them using a learned relevance metric to provide pertinent data about distant tokens. Attention has been used with Recurrent Neural Networks (RNNs) and has been shown to improve performance [Wang et al. 2016b; Luong et al. 2015]. The development of the Transformer architecture showed that attention mechanisms could function effectively on their own, eliminating the need for sequential recurrent data processing to produce the performance seen in RNNs with attention. Transformers employ an attention mechanism without an RNN, processing all tokens simultaneously and allocating attention weights among them in layers. Since the attention mechanism only uses information about other tokens from lower layers, it can be computed for all tokens in parallel at each layer, which leads to improved training speed. 11 Encoder-Decoder Model The RNN and the Transformer output the hidden state corresponding to each input to- ken. This means that the output length is equal to the input length. In most sequence- to-sequence problems, the desired output length often differs from the input length. To handle this, the model can use a prompt token or end-of-string token to denote that the input sequence ends, and the model should start generating the corresponding target sequence. During training, the loss is computed for the tokens after the prompt. Simi- larly, during inference, the outputs before the prompt are disregarded. This architecture utilizes the same network to encode the input and generate the output sequence. This approach is adapted by models such as GPT [Radford et al. 2018 2019]. Alternatively, two networks can be used to encode and generate the target sequence. This resembles the encoder-decoder architecture. Firstly, the encoder learns the em- bedding of each token in the context of the whole sequence. Thereafter, the decoder takes this sequence of embedding or a summary thereof to generate the desired target sequence of tokens. The decoded sequence is conditioned on all the input tokens and their previously generated tokens. This variant is widely used for sequence-to-sequence problems [Mohamed et al. 2021; Tan et al. 2020]. 3.2.2 Architecture Overview The Transformer follows an encoder-decoder architecture that consists of embeddings, multi-headed attentions, and FNNs. This is depicted in Figure 3.1. Given a sequence of tokens, the Transformer creates embedding vectors for each token and its position us- ing an input layer and a positional embedding layer. The embeddings are then passed through a series of encoder and decoder blocks on the encoder and decoder, respec- tively. In contrast to RNNs, all inputs are fed into the network simultaneously rather than sequentially. Each encoder and decoder block contains attention components that allow each pro- cessed token to be aware of relevant information from other tokens in the sequence. The decoder output is passed through a fully connected layer and a softmax function that converts these to the probability of each token in the vocabulary. Teacher forcing is used to train the decoder [Raffel et al. 2020], allowing training to occur in parallel. Since the decoder will not have access to future tokens during inference, an attention mask is used to hide these tokens so that the model doesn’t attend to them during training. 3.2.3 Input Embedding The first layer in the Transformer is the input, or word embedding layer. First proposed in Bengio et al. [2000], the input embedding layer takes in a tokenized sequence of numbers in a fixed-size vocab Σ and maps these to vectors of continuous numbers. These vectors represent the learned meaning of each token, so tokens that are closer in the embedding space are expected to have similar meanings. The context, semantics, and syntactic properties, as well as the linear relationship between tokens, can be cap- 12 Figure 3.1: Transformer architecture1 tured by vectors more efficiently than by the discrete entity in the vocabulary [Collobert and Weston 2008; Mikolov et al. 2013]. More formally, given a vocabulary Σ, and a sequence of discrete tokens X = {x1, x2, ..., xN}, where xi ∈ Σ, the input embedding (IE) layer learns a mapping of each token to a d- dimensional real-valued vector IE : N → Rd. The dimension of the embedding d, which represents the width of the Transformer, is a hyper-parameter that is normally chosen to be much smaller than the vocabulary size (d << |Σ|) [Collobert and Weston 2008; Stahlberg 2020]. For instance, the Transformer vocabulary for language model- ing typically has up to 50000 tokens although the embedding dimension is within the range of 512− 1024. A higher-dimensional embedding can capture fine-grained relationships between words, however, requires more data to learn. The study carried out by Wies et al. [2021] shows that the ratio between the dimension and vocabulary of the data and the depth of the Transformer can have an impact on performance. Furthermore, the dimension d should be divisible by the number of attention heads. Press and Wolf [2016] showed that there is a performance gain when the input em- bedding layer is shared between the decoder and encoder where the input and output vocab is the same. As a result, it’s been adopted in the design of the Transformer 13 [Vaswani et al. 2017; Lewis et al. 2019; Raffel et al. 2020]. 3.2.4 Positional Encoding The input embedding gives the embedding of each token, however, does not account for the order or position of each token. However, the meaning can change depending on the order of the tokens. For example, “a mosquito bit John” and “John bit a mosquito” have the same tokens, yet, they mean different things due to the order of the tokens. RNNs can implicitly learn positional information from their sequential processing. Dif- fering from RNNs, Transformers process data in parallel using an attention mechanism that does not have a notion of order. As such, the transformer explicitly adds positional information to the input embeddings through this module. Positional encodings PE map the position of each token to a d-dimensional real-valued vector PE : R → Rd. This is then summed with the input embedding E = IE + PE to produce embeddings with both positional and contextual information. Alternatively, these two vectors can be concatenated; however, this will increase the memory footprint and will slow down training, as the optimization process needs to adjust more parameters due to the extra dimensions. The original implementation of the Transformer model used sine and cosine functions of different frequencies to add positional information denoted with the following equa- tion: PE(pos,i) = { sin(pos/100002i/d) if i = 2k cos(pos/100002i/d) if i = 2k + 1 where pos denotes the position of the token in the input sequence, i is the index of each dimension in the output vector, and k is the offset. This positional encoding can capture absolute and relative positional information. Furthermore, it can scale to unseen lengths of sequences. Vaswani et al. [2017] also experimented with learned positional embeddings [Gehring et al. 2017a] and found that the results are on par with the sinusoidal encoding proposed. Subsequent work [Devlin et al. 2018; Radford et al. 2018; Lewis et al. 2019] adopted the learned positional embedding. Wang and Chen [2020] do an analysis of positional embeddings under different training objectives. The positional encoder is not shared between the encoder and the decoder, in contrast to the input embedding layer. The fully encoded input tokens E are then passed to the multi-headed self-attention layer. 3.2.5 Attention Mechanism The attention mechanism is a central component of the Transformer architecture, and its operation varies between the encoder, decoder, and the encoder-decoder model. The 1Adapted from https://d2l.ai/chapter_attention-mechanisms-and-transformers/ transformer.html 14 https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html Transformer treats the input representations E as a set of key-value pairs. For each po- sition j in E, the attention mechanism queries its relationship to all representations in E. This is known as the self-attention mechanism since queries and key-value pairs are of the same sequence. This is used in the encoder and decoder to learn the relationship between the inputs. Furthermore, the decoder has a cross-attention mechanism where queries come from its inputs and key-value pairs come from the encoder. More formally, the attention mechanism maps a query vector Q and a set of key-value vector pairs (K,V ) to an output vector A. The output is generated by computing a weighted sum of the values. The weight assigned to each value is determined by a compatibility function of the query with the corresponding key. These weights are called attention scores [Vaswani et al. 2017]. Since Q,K and V are d-dimensional, they can be organized into matrices Q ∈ RM×d, K ∈ RN×d and V ∈ RN×d, where M is the query sequence length, and M = N for self-attention. Scaled Dot-Product Attention The Transformer adopts a scaled version of the dot product attention [Luong et al. 2015] to compute the attention scores. Figure 3.2b depicts this, and can be denoted with the following equation: A = Attention(Q,K, V ) = softmax ( QKT √ d ) V The similarity between Q and K is computed as their dot-product scaled by 1/ √ d. This is then passed through a softmax function to produce the attention scores. The dot- product of the attention scores and V gives the output matrix A ∈ RM×d. The scaling prevents the softmax function from giving values close to 1 for highly correlated vectors and values close to 0 for non-correlated vectors, making gradients more reasonable for back-propagation. The softmax function normalizes over the columns of the scaled dot product of Q and K so that the weights for each query vector sum up to one. Multi-Head Attention Instead of performing a single attention function with Q, V , and K, the Transformer performs the scaled dot-product attention h times using separate identical attention layers. A single layer is called a head and h is the number of heads in multi-head attention (MHA), represented by Figure 3.2a. For each head, learned linear projects are used to transform Q,K, and V to a d/h- dimensional space, then passed through the scaled dot-product attention in parallel, yielding h attention heads with d/h dimensions. These are concatenated and projected again to a d-dimension matrix. The total computational complexity is comparable to that of single-head attention with full dimensionality because of the reduced dimension of each head. Formally, MHA can be defined as MHA = Concat(head1, ...,headh)W A 15 where WA ∈ Rd×d and headi = Attention(QWQ i , KWK i , V W V i ), WQ i ,WK i ,W V i ∈ Rd×d/h for i ∈ [1, h]. With MHA, the model can look at information from different represen- tation subspaces at the same time. Voita et al. [2019] and Clark et al. [2019] provide more information on the different semantics captured by each head. Masked Multi-Head Attention The encoder and decoder have different requirements for attention due to their distinct roles in the Transformer model. While the encoder operates on an input sequence, the decoder generates an output sequence. To address this, the model uses masked multi-head attention. In the encoder, there is no need for masking since all positions in the input sequence can be considered during the self-attention process. However, in the decoder, the self- attention mechanism must prevent information flow from future tokens to the current token, to maintain causality. This is achieved by applying an attention mask. When gen- erating token ŷt at time step t, the model should only attend to the tokens preceding it, i.e., ŷ0, y1, ..., ŷt−1. To enforce this constraint, the attention mask sets all future atten- tions (i.e., those for t+ 1, t+ 2, ..., T ) to negative infinity in the input of the softmax. As a result, these future attentions receive zero attention scores and do not influence the current token’s generation. Beyond temporal masking, attention masks are also useful for cases where sequences in a batch have varying lengths. To handle this, special padding tokens are often added to sequences, and the model should be prevented from attending to these padding tokens. The attention mask serves this purpose as well. Self-Attention The Transformer employs self-attention [Lin et al. 2017b; Cheng et al. 2016; Parikh et al. 2016] in the encoder and a masked version thereof in the decoder. The representa- tion of each position in the sequence is computed by relating it to every other unmasked position in the sequence. The Transformer entirely depends on the attention mecha- nism as opposed to RNNs and Convolutional Neural Networks (CNNs) [Gehring et al. 2017b; Kalchbrenner et al. 2016]. Self-attention is free from the assumptions made by RNNs and CNNs that the embedding of one token depends more on its neighboring tokens [Dou 2022]. Cross-Attention The second MHA layer in the decoder applies cross-attention, where the queries are derived from the decoder and the key-value pairs from the encoder’s final outputs. This module is the only connection between the encoder and the decoder and serves to mix the two embedding sequences. As such, the decoder has access to its previously generated tokens through self-attention and the input tokens via this module. In summary, self-attention is used in the encoder to model relationships within the in- put sequence, while masked self-attention is applied in the decoder to ensure causal 16 (a) Multi-Head Attention (b) Scaled Dot-Product Attention Figure 3.2: Transformer - multi-head attention architecture. Adapted from Vaswani et al. [2017]. generation of output tokens. Cross-attention, on the other hand, bridges the gap be- tween the encoder and the decoder, enabling the decoder to incorporate information from both its own generated tokens and the encoder’s output sequence. This distinction in how attention is used between the encoder, decoder, and encoder-decoder model is a key feature of the Transformer architecture. 3.2.6 Position-wise Feedforward Neural Network (FNN) At the end of each encoder/decoder layer, the Transformer inserts a fully connected feedforward network that is applied to each position separately and identically. This consists of two linear transformations with a non-linear activation in between, usually a ReLU [Nair and Hinton 2010] or GeLU [Hendrycks and Gimpel 2016]. This incorpo- rates non-linearity into the model explicitly and strengthens the model capacity [Kim 2022]. 3.2.7 LayerNorm and Residual Connections To improve the stability of training, the “Add & norm” applies a residual connection [He et al. 2016] and layer normalization [Ba et al. 2016] in the output of the MHA module and the feedforward module. Residual connections add the input of a module to its output, improving gradient propagation. This addresses the problem of deterioration of the training accuracy as more layers are added to the neural architecture [He et al. 2016]. Layer normalization adjusts the mean and standard deviation of the output of a layer, which can accelerate the training. The output of the “Add & Norm” can thus be defined as: LayerNorm(ZR) = ZR − E[ZR]√ V ar[ZR] + ϵ · γ + β 17 where γ, β are learnable parameters, ZR = Z + ZO is the residual output with Z being the input to each sublayer (MHA or feedforward), and ZO is the output from these sublayers. A small value ϵ is added in the denominator for numerical stability. The mean E[ZR] and variance V ar[ZR] are respectively denoted by the following: E[ZR] = 1 N N∑ i=1 Z (i) R V ar[ZR] = 1 N N∑ i=1 ( Z (i) R − E[ZR] )2 . 3.2.8 Final layer The output of the final decoder layer passes through a final layer, which acts as a classifier/output embedding layer. This maps the final hidden states of the decoder from d-dimension back to the vocabulary size. The classifier output is then fed into a softmax layer, which will produce probability scores between 0 and 1. The index of the highest probability score is taken as the predicted token. Following the works of Press and Wolf [2016], the output embedding layer is tied to the input embedding which reduces the number of parameters and also improves performance. 3.3 Metric Learning In Chapter 5 we propose to solve Math Word Problems as a verification task. Verification can be addressed as a metric learning task, a classification task, or more recently, a disentangled representation learning task [Ridgeway and Mozer 2018; Lee et al. 2020]. This section provides insights into the fundamentals of deep metric learning methods related to the verification setup in Chapter 5. The goal of the metric learning algorithm is to learn a metric that assigns a small distance to similar points and a relatively large distance to dissimilar points. The training can follow a weakly supervised or supervised setting. In the weakly super- vised setting, the goal is to learn a distance metric that puts positive pairs close together and negative pairs far away, while in the supervised setting, the goal is to learn a dis- tance metric that puts points with the same label close together while pushing away points with different labels. This research focuses on using a supervised setting. More specifically, for a set of data points X = {x1, x2, ..., xN}, where xi ∈ Rd and their corresponding discrete labels Y = y1, y2, ..., yN ∈ Z+, the goal is to train a model Fθ(·) : X → Rd together with a distance D : R → R, such that for any pair of points xi, xj ∈ X, D(xi, xj) is smaller if the corresponding labels are the same, that is, yi = yj, yi, yj ∈ Y , otherwise it should produce a larger value. Here θ represents the learned weights, d is the embedding dimensions, and N is the number of data points. Thus, we need to choose the right distance function D, and the loss function L to train our model Fθ [Kha Vu 2021]. The distance metric D is discussed in Section 3.3.1, followed by loss functions in Section 3.3.2 and how to choose training samples in Section 3.3.3. 18 3.3.1 Distance Metric A Metric is a non-negative function D between two points xi and xj that describes the notion of ‘distance’ between these two points. Let X be a nonempty set. A distance or metric over X is a map d : X ×X → R that satisfies the following properties: • Non-negativity: The distance between two distinct points is always positive, D(xi, xj) > 0 and D(xi, xi) = 0. • Symmetry: The distance from xi to xj is always the same as the distance from xi to xj, i.e., D(xi, xj) = D(xj, xi). • Triangular inequality: The sum of distances between two lines drawn from three points cannot be greater than the third line, D(xi, xj) ≤ D(xi, xh) + D(xh, xj), ∀xi, xj, xh ∈ X. One other property of distance metrics not mentioned above is the identity of indis- cernibles, which states if D(xi, xj) = 0, then xi = xj. This property does not need to hold for metric learning; as such, we consider pseudo-distances [Collatz 2014]. Further- more, for the purpose of training our models, it is usually sufficient to calculate some transformation of the distance metric that still gives us a similarity measure between xi and xj. Euclidean Distance Euclidean distance is one of the most common distance measures. Also known as the L2 metric, this can be described as a measure of the length of a line segment between two points in Euclidean space. Using the Pythagorean theorem [Weisstein 2006], the square of the euclidean distance can be calculated from the Cartesian coordinates of the points: D(xi, xj) = ||xi − xj||2 The square of the euclidean distance is often used since it avoids computing the square root, making it more efficient from a computation perspective. Although it is not really a distance, it retains the most useful properties of a distance metric from a learning perspective. The Euclidean distance is not scale invariant, which can result in skewed distances. As such, it is beneficial to standardize/normalize before use. Moreover, as the dimensionality d increases, the less useful Euclidean distance becomes less mean- ingful. It fails to capture non-linearity in data since the distance is isotropic, i.e., the same in every direction. The Mahalanobis distance is a more general case of the euclidean distance that is non- isotropic, given by D(xi, xj) = ||Wxi − Wxj||2, where W is a linear transformation matrix found by decomposing the inverse of the positive-definite covariance matrix M = W TW of a given probability distribution. For the euclidean distance, W is the identity matrix. This distance is prone to overfitting due to its dependence on the mean, and the linear transformation W is not always able to capture the inherent non- linearity of the data. 19 Cosine Distance The cosine distance uses cosine similarity to determine the distance between data points. Cosine similarity can be defined as the cosine of the angle between two vectors in an inner product space, which can be computed as the normalized Euclidean dot product of two vectors: sim(xi, xj) = xi · xj ||xi||||xj|| Cosine similarity ranges between -1 and 1, where 1 means the angle between xi and xj is 0, i.e., they are similar. To ensure that cosine similarity is non-negative, we can define cosine distance as: D(xi, xj) = 1− sim(xi, xj) The cosine distance measures the dissimilarity between xi and xj. Since the cosine similarity is computed in the Euclidean space, we can derive the equivalent Euclidean distance to the cosine distance. When xi, and xj are normalized to unit vectors such that ||xi|| = ||xj|| = 1 we have: ||xi − xj||2 = ||xi||2 + ||xj||2 − 2xi · xj = 2(1− sim(xi, xj)) As we can see, the cosine distance is given by ||xi−xj ||2 2 . While the cosine distance is not a metric itself, we can easily see that it is a square of the Euclidean distance, and the square root thereof produces a distance metric [Schubert 2021]. This is suffices for metric learning. From a computation perspective, a dot product is faster because it saves on one vector subtraction and does not suffer from “catastrophic cancellation” for small distances [Schubert and Gertz 2018; Lang and Schubert 2020]. As such, it is more efficient to compute the Euclidean distance as a dot product. The unnormalized Euclidean distance is commonly used on dense, continuous variables where dimensional magnitudes matter, whereas the normalized Euclidean distance is commonly used on sparse, discrete domains such as text since it was more robust to variations in occurrence counts between semantically related terms. Similar to the Euclidean distance, cosine distance suffers from the curse of dimension- ality, though with a lesser effect. As the dimensionality increases, the number of data points required for good performance of any machine learning algorithm increases ex- ponentially. Chen et al. [2020b] found that projecting the embedding vectors Fθ(·) to a lower dimension before applying the distance measures produces better results. To handle this, data mining methods, together with loss functions are implemented to be more robust to this. 3.3.2 Losses The loss function is an objective function that our model learns to minimize or maxi- mize. This function produces a scalar value that indicates how the model fits the data; 20 therefore, improvements in this number correspond to a better model [Goodfellow et al. 2016; Reed and MarksII 1999]. The metric learning objective function predicts relative distances between inputs, which differs from other objective functions for classification and regression problems, whose objective is to learn to predict a label, a value, or a set of values given an input. Standard regression and classification algorithms can be categorized as point-wise loss functions, where a single data point is considered at a time through the loss function. On the other hand, metric learning considers a pair or a list of data points at a time. That is, the metric learning objective trains the model Fθ to produce similar represen- tations by minimizing the embedding distance when the data points have the same label, otherwise, the model should produce distant representations, i.e., maximize the embedding distance. Pairwise Loss The pairwise loss function dates back to one of the first metric learning losses proposed [Chopra et al. 2005; Hadsell et al. 2006]. It follows a simple and intuitive objective function to directly minimize the pairwise distance between two samples, x1 and x2, with the following formulation: L = Iy1=y2D(x1, x2) + Iy1̸=y2max (0,m−D(x1, x2)) where m > 0 is a set hyper-parameter margin, defining the lower bound distance be- tween samples of different labels. When the representations produced for a negative pair are sufficiently distant, no effort is wasted in enlarging that distance, so further training can focus on more difficult pairs. Triplet Loss The pairwise loss uses two samples, however, more samples can be used. For instance, Wang et al. [2014], Schroff et al. [2015], Cui et al. [2016] and Hermans et al. [2017] use triplets instead of pairs and reported improvements in performance. Triplet-based losses consist of an anchor x, a positive sample x+, and a negative sample x−. The triplet loss learns a distance metric by which the anchor is further from the negative samples than it is from the similar point by a margin D(x, x−) > D(x, x+)+m. The loss can be formulated as: L = max(0,m+D(x, x+)−D(x, x−)) where m is the margin. The margin prevents a trivial solution to the loss function which is found when both of the distances are equal to zero D(x, x+) = D(x, x−) = 0 Since the network is updated with both similar and dissimilar samples, the triplet loss requires fewer samples for convergence. However, the performance of the triplet loss is highly dependent on the sampling process of x+ and x−. If the positive is already closer to the anchor than the negative, the model will learn almost nothing since the requirement on the relative distance of the embeddings is 21 already satisfied; however, if the sampled positive is far away from the anchor or a negative very close to it, the model improves much more during each training step [Kaya and Bilge 2019]. The triplet loss only compares a sample to one negative sample during one update, ignoring other negative samples. The embedding vector for a sample is only guaranteed to be far from the selected negative sample, not from the all other negatives. Thus, an example can be distinguished from a few negative samples while remaining close to many other negatives. After looping over enough randomly sampled triplets, the final distance metric should be balanced; however, individual updates may be unstable and converge slowly [Sohn 2016; Weng 2021]. N -pair Loss Sohn [2016] proposed to extend the triplet loss to include a comparison with multiple negative samples. The proposed N -pair loss, uses one anchor x, one positive sample x+ and N − 1 negative samples {x− i }N−1 i=1 . If N = 2, the N -pair loss reduces to triplet loss. Mathematically, this can be denoted as follows: L = log ( 1 + N−1∑ i=1 exp(D(x, x− i )−D(x, x+)) ) = − log exp(D(x, x+)) exp(D(x, x+)) + ∑N−1 i=1 exp(D(x, x− i )) The equation produced by N -pair loss is similar to the softmax loss. Although the softmax loss may assist the model being trained to converge to the optima, it does not seek to keep positive pairs closer and negative pairs farther similar to what metric losses do. Softmax loss only learns indistinct features. As a result, the methods use metric learning with softmax. Several proposals [Goldberger et al. 2004; He et al. 2020a; Oord et al. 2018; Gutmann and Hyvärinen 2010 2012] follow the same structured loss function, with different distance measures D. Increasing the number of negative samples N has been found to improve the contrastive power of the model [Khosla et al. 2020; He et al. 2020a; Chen et al. 2020b; Tian et al. 2020; Henaff 2020]. Supervised Contrastive Learning A more generalized form of N -pair loss was proposed by Khosla et al. [2020], which can cater to multiple positive samples in addition to multiple negatives. The proposed SupCon loss aims to improve the weaknesses of the N -pair loss and the softmax func- tion. The heavy dependence on hard mining from the N -pair loss is minimized by using many positives and many negatives for each anchor. Moreover, the softmax loss func- tion was shown to outperform the N -pair loss for classification tasks [Horiguchi et al. 2019]. However, the softmax loss is prone to produce poor margins. The SupCon loss is designed so that it can effectively use the ground truth targets to embed samples with the same label closer than samples with different labels. The mathematical derivation 22 is given by L = 2N∑ i=1 1 2|Ni| − 1 ∑ j∈Ni,j ̸=i log exp(D(xi, x + j ))∑ k∈I,k ̸=i exp(D(xi, x − k )) where Ni = {j ∈ I : ỹi = ỹj} contains a set of labeled sample indices and |Ni| is its cardinality. Khosla et al. [2020] found that including more positive samples into the set Ni leads to better results. Furthermore, SupCon was found to be less sensitive to hyper-parameter changes and noise [Khosla et al. 2020]. Multi-Similarity Loss The drawbacks of conventional loss functions such as triplet loss and contrastive loss, which sometimes lack discriminative power when applied to large datasets, are ad- dressed by the multi-similarity (MS) loss [Wang et al. 2019b]. Traditional loss func- tions, which are limited in their ability to capture complex similarity relationships in high-dimensional feature spaces, only consider pairwise similarity relationships be- tween samples. The idea behind MS loss is to measure similarity in more than one direction, as opposed to just one or two, as is the case with conventional loss functions. The loss function then encourages the model to learn a representation of features that maximizes the similarity between positive examples and minimizes the similarity between negative examples. L = 1 |X+ ∪X−| [ 1 α log[1 + ∑ (x,x+)∈X+ exp(−α(D(xi, x + j )−m))] + 1 β log[1 + ∑ (x,x−)∈X− exp(β(D(xi, x −))−m))] ] where α and β are hyper-parameters. 3.3.3 Miners Deep metric learning relies on the ability to learn meaningful representations of data, where the similarity or dissimilarity between data points is crucial. Unlike conven- tional classification or regression tasks, in which random sampling of data points often suffices, deep metric learning requires a more nuanced approach to sample selection. The reason lies in the fundamental objective of the model: instead of predicting labels, it aims to understand the relationships between data points, learning to push similar items closer and dissimilar ones apart [Bucher et al. 2016; Kaya and Bilge 2019; Wang et al. 2019b]. The effectiveness of the deep metric learning network depends not only on the quality of its mathematical models and architectures but also on the discriminative power of the samples it encounters during training. To facilitate the network’s learning and enhance the quality of the learned representations, it is imperative to provide it with training instances that challenge its ability to distinguish between data points. 23 In this context, sample selection plays a pivotal role as a preprocessing step before applying the deep metric learning model. The process of selecting the right training examples to feed the network is akin to setting the stage for the model’s success [Schroff et al. 2015]. To train the model effectively to learn a meaningful distance metric, it is essential to sample pairs of positive and negative data points from the datasets. This process of sample selection and pair creation is often referred to as ”mining.” The importance of mining in the context of deep metric learning has garnered considerable attention in the research community, with various techniques and strategies proposed to optimize the learning process [Wang et al. 2018a 2014; Wu et al. 2017; Hermans et al. 2017; Schroff et al. 2015; Sikaroudi et al. 2020]. Anchor, Positive and Negative Samples Let X be the set of examples and Y be the set of corresponding labels, denoted as {(xi, yi)}. Within this context, we define the sets of pairs as follows: The positive pair set, denoted as X+, consists of pairs where the labels match, that is, X+ = {(x, x+) ∈ X ×X | y = y+}. The negative pair set, denoted as X−, comprises pairs with differing labels, which can be represented as X− = {(x, x−) ∈ X ×X | y ̸= y−}. In this context, the samples x, x+, and x− are typically referred to as the anchor, posi- tive, and negative samples. The primary objective is to train the model, represented as fθ, such that for every anchor, the positive sample embeddings, as measured by some distance function D, are closer to the anchor than the negative sample embeddings by a margin m. Formally, this can be expressed as: D(x, x+) +m < D(x, x−), ∀x ∈ X, (x, x+) ∈ X+, (x, x−) ∈ X− (3.1) Easy, Semi-hard and Hard Samples Given an anchor, positive, and negative sample combination (x, x+, x−), the inequality in Equation (3.1) from the previous section may or may not be violated. When the inequality is not violated, the positive and negative samples (x+, x−) are considered easy. Since the model already gets them right, easy samples do not improve the network when trained. Semi-hard samples occur when the inequality in Equation (3.1) is only violated by the margin m, that is, D(x, x+) < D(x, x−) < D(x, x+) +m. In other words, the negative sample is within a marginal distance away from the positive. When the negative sample is closer to the anchor than the positive sample, (x+, x−) is considered a hard sample. This is denoted as D(x, x−) < D(x, x+). The semi-hard and hard samples represent samples that the model has not yet mastered. Therefore, we need to mine these samples so that we can train the model on them. 24 Offline Mining Mining for hard samples can be done offline before the batch is created. The whole training set is passed through the model in order to obtain the model embeddings. Subsequently, the distances are computed, where the sets of anchor, positive and neg- ative (x, x+, x−) are dropped if they are easy. Different implementations keep only the hardest, hard, semi-hard, or a combination of these. The remaining valid sets are di- vided into K batches. In this way, offline mining creates batches that have the most informative samples. This technique computes the embeddings of the full training set to generate batch sam- ples, which have some memory and computation overhead. Valid samples can change after the batch gradient updates. This means that the offline mined sets need to be updated frequently, after several steps, or at the end of every epoch to avoid overfit- ting. This renders offline mining inefficient. Offline mining can be adopted when the datasets are small or if the hardness of the samples can be computed independently of the model embeddings where these drawbacks are less dire [Prétet et al. 2020]. Online Mining Alternatively, mining can be done online, within each batch. First introduced in Schroff et al. [2015], online mining aims to reduce the computation and memory burden of offline mining by only considering the positive and negative samples in the sampled batch. Furthermore, this method uses up-to-date weights of the model to compute the embeddings. For each class label, n samples are selected for every batch. This provides the anchors and positive samples. Every other sample in the batch with a different label is considered a negative sample. This will generate easy, semi-hard, and hard samples. All valid triplets or only the hardest samples are used in the loss function to update the weights, where the latter often performs best [Schroff et al. 2015; Hermans et al. 2017]. The online option is the preferable choice for larger datasets since it can be compu- tationally prohibitive to compute and sort the pair-wise distances between all samples [Schroff et al. 2015; Musgrave et al. 2020]. Online mining has been well adopted in the literature [Wang et al. 2019c; Liu et al. 2019a; Lee et al. 2018; Chen et al. 2020a]. How- ever, online mining requires big batch sizes to include enough distinct negative data to push the model to acquire meaningful representation to distinguish various examples [Oh Song et al. 2017; Kaya and Bilge 2019]. Online Batch Hard Mining In the online batch hard mining, we seek to find the hardest positive and hardest nega- tive for each anchor. A positive sample x+ h and a negative sample x− h are said to be the 25 hardest if the following holds: x+ h = argmax (x,x+)∈X+ B ,(x,x−)∈X− B D(x, x+)−D(x, x−) x− h = argmin (x,x+)∈X+ B ,(x,x−)∈X− B D(x, x+)−D(x, x−) where X+ B and X− B are subsets of the positive and negative sets, X+ B ⊆ X+ and X− B ⊆ X− respectively, within the current batch. Mining for the hardest samples is computa- tionally taxing, however, relatively less compared to doing it offline. Furthermore, the hardest sample for online mining is relative to the batch, which makes it a moderate sample and less prone to causing the model to overfit. Hard sampling provides the biggest gradient updates, which can speed up convergence. However, mining very hard samples early on during training can cause the model to get stuck on local minima, and consequently, lead to model collapse [Hermans et al. 2017; Harwood et al. 2017; Wu et al. 2017]. Semi-hard sampling avoids the model overfitting on outliers and improves its robustness. However, only a few semi-hard samples can be generated, requiring large batch sizes. A curriculum where harder samples are gradually increased was found to perform better than semi-hard mining [Harwood et al. 2017]. Online Batch All Mining Rather than mining for the hardest samples, all the valid combinations can be con- sidered for the loss computation. This strategy is known as batch-all mining. For B samples in a batch XB and n samples for each label, all the positive samples X+ B = {(x, x+) ∈ X+|(x, x+) ∈ XB} and negative samples X− B = {(x, x−) ∈ X−|(x, x−) ∈ XB} we have a total of B(B − n)(n− 1) triplets that contribute to the loss. The batch-all approach results in less computational overhead in comparison to batch- hard online mining. However, this approach was found to perform consistently worse than batch hard online mining [Ding et al. 2015; Wang et al. 2016a]. Hermans et al. [2017] found that the inclusion of easy triplets was the major cause and only consid- ering combinations that violate the loss function improves performance. Khosla et al. [2020] and Wang et al. [2019b] propose loss functions that intrinsically perform hard mining by downweighing the contribution of easy samples. Multi-Similarity Miner Wang et al. [2019b] proposed a mining technique in which negatives are selected if they are closer to an anchor than its most difficult positive, and positives are selected if they are further away from an anchor than its most difficult negative. More formally, for positive samples X+ B ⊆ X+ and negative samples, X− B ⊆ X− in a batch, and a 26 margin m, (x, x+) and (x, x−) are mined for the loss if the following holds: D(x, x−) > min (x,x+)∈X+ B D(x, x+)−m D(x, x+) < max (x,x−)∈X− B D(x, x−) +m 3.4 Pooling Processing sequential data yields the same number of embedding vectors as there are tokens in the input sequence. That is, the embeddings are at a token level with a variable length. It is often desired to distill the information encoded by each of the token embedding vectors into one fixed-length vector that represents the context of the sequence, that is, a sequence embedding. The context vector can then be used for classification and metric learning tasks such as sentiment analysis, natural language inference, and text similarity [Farouk 2019; Kowsari et al. 2019; Williams et al. 2017; Reimers and Gurevych 2019]. Pooling is a procedure that can be used to produce a representative context vector. Several pooling methods can be used, such as simply taking the average of all the tokens or taking the max value at each dimension across all tokens. Reimers and Gurevych [2019] found that the significance of the pooling procedure depends on the objective function and the datasets used. As a result, it is worth experimenting with different pooling strategies. In the following subsections, we discuss several pooling methods that can be used to create a context vector. In particular, we discuss single token pooling (Section 3.4.1), max-pooling (Section 3.4.2), average pooling (Section 3.4.3), attentive pooling (Sec- tion 3.4.4), and mixed pooling (Section 3.4.5). 3.4.1 Single Token pooling Early sequential processing techniques, such as the RNNs produced a context vector at the end of a sequence that stored the information of all the tokens it saw. As such, for a sequence that is concatenated by two special tokens, a beginning-of-string token bos at the beginning and an end-of-string token eos at the end, the context vector in RNNs is the eos token. Transformers on the other hand do not have this leverage. Devlin et al. [2018] train their proposed Transformer architectures to produce a context vector through the bos token. The bos token, also called the CLS token, is more consis- tent than the eos token since it occurs at the same position every time. performs well for the classification objective under which it was trained, however, it produces poor results for metric learning tasks [Reimers and Gurevych 2019]. Furthermore, pooling the bos token models that were not pre-trained to produce a context vector through this token can lead to poor performance. 27 3.4.2 Max-pooling Another common strategy to produce a compressed representation is to pool the maxi- mum value over time for every dimension [Kalchbrenner et al. 2014; Joulin et al. 2016]. Formally known as max-pooling [LeCun et al. 1998], this method extracts the most prominent features across all dimensions. Furthermore, max-pooling is invariant to sentence length, as opposed to average pooling and attentive pooling [Suárez-Paniagua and Segura-Bedmar 2018]. Max-pooling is particularly well suited to the separation of features that are very sparse [Boureau et al. 2010ba]. With very long sequences, the information from most token embeddings will be dropped. To preserve more information, the more general k-max-pooling method can be em- ployed [Kalchbrenner et al. 2014]. For a sequence with N d-dimensional token em- beddings EN ∈ Rd and k ≤ N , the k-max-pooling operation pools the k most active features in E, producing k fixed vectors Ek max ∈ Rd. The hyper-parameter k can be set statically or dynamically as a function of N to allow for smooth extraction of features [Kalchbrenner et al. 2014; Shu et al. 2018]. To produce one context vector, Zhao et al. [2018] sums Ek max and scales this by a factor s. If s = k, the context becomes the mean of Ek max, and if k = N , then the scaled k-max-pooling is equivalent to average pooling. 3.4.3 Average Pooling The average pooling [LeCun et al. 1998 1989; Reimers and Gurevych 2019] computes the context vector by taking the element-wise arithmetic mean of the token embed- dings. Different from max-pooling, average pooling down-weighs the activated features by combining the non-maximal features. This makes the average pooling procedure more suited for preserving background information. 3.4.4 Attentive Pooling Another option to produce the context vector is to use attention [Luong et al. 2015; Bahdanau et al. 2014] to get a weighted sum of the inputs instead of the average pool. Not all tokens in the sequence contribute equally to the meaning of the sentence. The attentive pooling mechanism learns to weigh the relevant tokens more than the irrelevant ones to produce a context embedding. The relevance weights are learned alongside the model. Given a sequence of N embeddings E = {e1, e2, ..., eN}, the context vector is produced by the following: Eseq = ΣN i=1αiei αi = exp(u⊤ i us) Σ exp(u⊤ i us) ui = tanh(Wsei + bs) where Eseq is the sentence embedding, αi is the attention weight corresponding to the ith token embedding, and Ws, bs, and us are trainable parameters. The sentence-level context vector us can be randomly initialized and jointly learned during the training process [Yang et al. 2016; Wu et al. 2020a]. Alternatively, us can be set to be the unit 28 matrix [Lin et al. 2017b; Chen et al. 2018]. The attention mechanism can be extended to multi-head attention [Chen et al. 2018; Wu et al. 2020c]. Different from the mean and max pools, the attentive pool contains trainable param- eters. This allows the model to choose which tokens it finds the most useful for the sequence context at the expense of additional parameters and possibly training time. Furthermore, under low data settings, the attentive pool can lead to over-fitting [Li et al. 2018]. 3.4.5 Mixed Pooling It is also possible to use more than one pooling technique. For instance, Yu et al. [2014], Lee et al. [2016], Zhao et al. [2018] and Zhou et al. [2021] saw performance gains when they combined average pooling and max-pooling. This combines the advantages of each strategy. The max pool or average pool can be used as the query vector for an attentive pooling procedure [Maini et al. 2020; Liu et al. 2016]. Sahu and Anand [2018] and Suárez-Paniagua and Segura-Bedmar [2018] concatenates the results of multiple pooling strategies. Concatenating these methods, however, introduces more parameters that take more time to learn or tune and increases computational overhead. Gholamalinezhad and Khosravi [2020] and Zafar et al. [2022] review several other pooling techniques used in the literature. 3.5 Natural Language Processing The challenge to solving Math Word Problems (MWPs) is found in the domain of Natu- ral Language Processing (NLP). More specifically, MWPs are presented as textual data, and then fed through an NLP model, which is expected to generate a mathematical expression that solves the problem. NLP models process text in its token form, which is produced through a process known as tokenization. The objective to solve MWPs can be presented to the model as a sequence generation task, where the model needs to generate the tokens corresponding to the correct solu- tion, or as a ranking task, where the model generates scores for several hypotheses as its choice to the final solution. In either case, the model first encodes each of the to- kens into an embedding space and then uses these embeddings to generate the desired outcome. The point-wise ranking task can be viewed as a metric learning task, with the input MWP as an anchor, the model learns to score positive solutions higher than negative solutions [McFee and Lanckriet 2010; Cakir et al. 2019]. The remainder of this section is organized as follows: In Section 3.5.1, we delve into the tokenization process, encompassing word-level, character-level, subword-level, and byte-level tokenizers. Subsequently, in Section 3.5.2, we explore two generative ob- jective techniques, namely, greedy search and beam search. We then delve into the utilization of label smoothing and teacher forcing, which can effectively improve the generation objective during the training process. 29 3.5.1 Tokenization Tokenization is an integral part of virtually every NLP task and is arguably the first step in preprocessing that is linguistically motivated because it identifies the basic units on which all other processing is based [Michelbacher 2013]. Tokenization breaks raw text into discrete pieces representing either words, phrases, symbols, or other meaningful elements called tokens [Gupta and Malhotra 2015; Verma et al. 2014]. The tokens can be treated as atomic components of the text, and in turn, form the building blocks of NLP [Lyon 2022]. Having tokenized our dataset, we can create our vocabulary, which is a set of unique tokens used in the text corpus. Text can be tokenized using a variety of methods, including word-level, character-level, and subword tokenization, respectively. Inadequate tokenization of any portion of the text can result in misconceptions later on in the natural language processing process. Word-level Tokenization A relatively simple and straightforward approach to split the text into words or tokens is to use a delimiter, such as white spaces or punctuation. Advanced rules can be incorporated to handle hyphens, suffixes, and prefixes [Honnibal and Montani 2017; Speer 2019; Koehn et al. 2007]. Word tokenization collectively refers to approaches that split text on a word level. An issue with word tokens is that taking every unique word in a large text corpus can lead to a very huge vocabulary sizes. For instance, Dai et al. [2019] use a word-level tokenizer that generates a vocabulary that is four times larger than single language pre- trained Transformers utilizing other granular tokenization techniques. A big vocabulary size necessitates an enormous input and output embedding layer, which increases the model’s memory and time complexity. Dealing with words that are “out of vocabulary” (OOV) is another significant challenge associated with word tokens. The majority of the time, this problem can be fixed by mapping all unknown words to a single token that represents ‘unknown’ tokens. This presents a number of difficulties, one of which is that the meaning of many of the tokens that are OOV is lost. Character-level Tokenization Rather than splitting text into words, it could be split into characters, this is exactly what character-level tokenization does. This resolves the above-mentioned issues with the word-level tokenization technique. The vocabulary is relatively small, which can have a size of 26 for a text-only case-insensitive English corpus, or 256 to incorporate all the ASCII2 codes. Character-level Tokenization can also handles OOV words coherently by preserving the 2https://www.asciitable.com/ 30 https://www.asciitable.com/ information of the word. It breaks down the OOV word into characters and represents the word in terms of these characters. This method is particularly useful for languages where individual characters carry important information, such as Chinese or Japanese. This strategy has been adopted by Xue et al. [2022] to design a token-free Transformer that can be used in unilingual and multilingual settings. Character-level tokenizer requires no training to learn a vocabulary. Several drawbacks arise with this tokenization technique, one being the length of the sequences of tokens it produces. Because each world is now comprised of all characters, the length of the tokenized sequence has the potential to easily surpass that of the original text. In addition, the model has to spend more capacity to reach a higher-level representation compared to other tokenization methods since the individual tokens on their own do not have any semantic meaning. Subword Tokenization The primary goal of subword-based tokenization techniques is to solve the large vocab- ulary size and a significant number of OOV tokens in word-based tokenization, and the challenges presented by character-based tokenization, which results in extremely long sequences and individual tokens with little significance. To do this, the subword-based tokenization algorithms do not split the frequently used words into smaller subwords. Some of the popular subword tokenization algorithms are WordPiece [Devlin et al. 2018], Byte-Pair Encoding (BPE) [Sennrich et al. 2015], Unigram, and SentencePiece [Kudo and Richardson 2018]. Our research adopts the BPE tokenizer, in particular, the Byte-level BPE tokenizer used in our pre-trained model [Lewis et al. 2019; Radford et al. 2019]. This is crucial because a pre-trained model can only work successfully if it is fed input that has been tokenized using the same rules as were used to tokenize its training data. Byte-level BPE Tokenization Byte-level BPE (BBPE) tokenizer a variant of the BPE tokenizer [Gage 1994; Sennrich et al. 2015], which was initially developed as an algorithm to compress texts and has been widely adopted for pre-trained language models [Liu et al. 2019c; Radford et al. 2018 2019; Lewis et al. 2019; Lample and Conneau 2019]. While standard BPE uses characters as the base tokens, BBPE utilizes bytes, such as those found in Unicode encoding, as the fundamental units of tokenization. This allows the tokenization to operate at a finer granularity, similar to character tokenization. The tokenization process can either remain at the byte level or use Byte Pair Encoding to merge bytes into larger units, depending on the specific requirements of the task. In practice, the Byte-level BPE Tokenization process begins by creating a base vocab- ulary using bytes. Using the byte-level base tokens, the BBPE algorithm identifies the most frequent ”byte-pairs” and replaces them with a new, combined token that repre- sents both of the original tokens. This process continues until a predetermined num- ber of merge operations have been completed or the desired vocabulary size has been 31 achieved. The number of merge operations and the size of the target vocabulary are configurable hyper-parameters [Wang et al. 2020; Radford et al. 2019]. The benefit of BPE is that it can capture both common and uncommon words in a corpus. BPE ensures that the most common words are represented as a single token in the vocabulary, while the rare and OOV words are broken down into smaller sub-word units that are shared with other words, allowing them to be more easily represented in the model. This makes BPE especially useful for languages with complex morphology, where words can take on a variety of inflections and conjugations. Moreover, the length of tokenized texts is shorter than character-level tokenization, and the vocabulary size is smaller than word-level tokenization [Mielke et al. 2021; Webster and Kit 1992]. 3.5.2 Sequential Generation Task The sequential generation task in NLP involves generating text one word or token at a time, in sequential order. The output generated at each time step is used as input for generating the next output. The model learns to predict the probability distribution of the next word given the previous words generated, and on some input context. Greedy search and beam search are two commonly used algorithms for generating sequences in NLP. Sequential generation is widely used in various NLP tasks such as language translation, text summarization, and dialogue systems [Lin et al. 2022; Kalyan et al. 2021; Allahyari et al. 2017]. Greedy Search In NLP, greedy search is a simple algorithm for generating text sequences. It is a type of search algorithm that generates the next word in a sequence by selecting the word with the highest probability at each time step. The initial seed for a greedy search can be a single word, phrase, or sentence. The model then generates, at each time step, a probability distribution over the vocabulary of possible subsequent words, based on the context of the preceding words. The model then selects the most probable word from this distribution and inserts it into the generated sequence. This process continues until a stopping criterion is met, such as a maximum sequence length or an end-of-sequence token. The resultant sequence is then output as the generated text’s final output. One of the primary benefits of greedy search is its simplicity and efficiency. It is a straightforward algorithm that requires minimal computational resources, making it fast and easy to implement. There are, however, some disadvantages to using greedy search. Due to the fact that it only considers the word with the highest probability at each time step, it can lead to suboptimal results, especially when the model makes an error early in the generation process, which can cause a cascade of errors [Meister et al. 2020a; Freitag and Al-Onaizan 2017; Yoo et al. 2020]. Beam Search Beam search is an alternative algorithm to greedy search that keeps track of the top k most probable sequences at each time step, where k is a predefined beam size. The 32 algorithm is comparable to the greedy search algorithm, with the exception that it chooses the top k words from the generated probability distribution based on their probabilities. The algorithm then generates k new candidate sequences by appending one of the k words to each of the k existing sequences. The k new candidate sequences are then ranked according to their probabilities, and the top k sequences are retained for the subsequent time step. This process continues until a stopping criterion is met, such as a maximum sequence length or an end-of- sequence token. The resulting sequence with the highest probability is then used to generate the final text. Beam search has the advantage of exploring multiple options at each time step, which can result in higher output quality than greedy search. Beam search can handle situa- tions where the most likely word at each time step might not lead to the best overall sequence. It does this by keeping track of multiple candidate sequences, which gives it a wider range of possible outcomes. Beam search is computationally more expensive than greedy search due to the need to keep track of multiple candidate sequences at each time step. Beam search has the potential to result in repetitive or redundant output, as it prefers taking up paths that are highly probable. Various techniques have been proposed to modify the beam search algorithm to address this issue, such as length normalization, which adjusts the probabilities based on the length of the sequence, and diverse beam search, which encourages diversity in the output by penalizing redundant sequences [Meister et al. 2020a; Freitag and Al-Onaizan 2017; Yoo et al. 2020; Meister et al. 2020b]. Label Smoothing Label smoothing is a regularization technique that is used to prevent models from be- coming overconfident and to improve their generalizability. Smoothing labels replaces hard 0/1 targets in training data with soft targets that distribute probability mass to labels with comparable attributes. A smoothed distribution that gives more weight to similar labels is used in place of presenting a label as a one-hot vector with a single 1 and all other values as 0. This enables the model to identify representations that a