MULTI-VIEW RANKING: TASKING

TRANSFORMERS TO GENERATE AND

VALIDATE SOLUTIONS TO MATH WORD

PROBLEMS
School of Computer Science & Applied Mathematics

University of the Witwatersrand

Rifumo Mzimba
1619542

Supervised by
Prof. Richard Klein

Prof. Benjamin Rosman

October 29, 2023

A dissertation submitted to the Faculty of Science, University of the Witwatersrand,
Johannesburg, South Africa, in fulfillment of the requirements for the degree of Master of

Science.


Abstract

The recent developments and success of the Transformer model have resulted in the
creation of massive language models that have led to significant improvements in the
comprehension of natural language. When fine-tuned for downstream natural lan-
guage processing tasks with limited data, they achieve state-of-the-art performance.
However, these robust models lack the ability to reason mathematically. It has been
demonstrated that, when fine-tuned on the small-scale Math Word Problems (MWPs)
benchmark datasets, these models are not able to generalize.

Therefore, to overcome this limitation, this study proposes to augment the generative
objective used in the MWP task with complementary objectives that can assist the model
in reasoning more deeply about the MWP task. Specifically, we propose a multi-view
generation objective that allows the model to understand the generative task as an
abstract syntax tree traversal beyond the sequential generation task.

In addition, we propose a complementary verification objective to enable the model to
develop heuristics that can distinguish between correct and incorrect solutions. These
two goals comprise our multi-view ranking (MVR) framework, in which the model is
tasked to generate the prefix, infix, and postfix traversals for a given MWP, and then
use the verification task to rank the generated expressions.

Our experiments show that the verification objective is more effective at choosing the
best expression than the widely used beam search. We further show that when our
two objectives are used in conjunction, they can effectively guide our model to learn
robust heuristics for the MWP task. In particular, we achieve an absolute percentage
improvement of 9.7% and 5.3% over our baseline and the state-of-the-art models on
the SVAMP datasets. Our source code can be found on https://github.com/ProxJ/

msc-final.

i

https://github.com/ProxJ/msc-final
https://github.com/ProxJ/msc-final


Declaration

I, Rifumo Mzimba, hereby declare the contents of this research proposal to be my own
work. This proposal is submitted for the degree of Master of Science in Computer
Science at the University of the Witwatersrand. This work has not been submitted to
any other university or for any other degree.

ii


Signature 


Date


Acknowledgements

Completing this dissertation has been one of the most challenging and rewarding ex-
periences of my academic career. To everyone who helped me along the way, I want to
express my sincere gratitude and appreciation.

To begin, I’d like to express my gratitude to my advisors, Profs. Richard Klein and
Benjamin Rosman, for their expertise, patience, and direction throughout my research.
Their suggestions and criticisms helped me shape and improve my ideas, and their
unflinching encouragement kept me going.

I would also like to thank my colleagues and friends who were there for me when I re-
ally needed their encouragement. A particular shout out to my professional colleagues
who took on extra responsibilities to allow me to focus on writing this dissertation.

I owe a great debt of gratitude to my family for their unending love, encouragement,
and patience throughout this process. I couldn’t have gotten through this period with-
out their unwavering love, encouragement, and support.

I would also like to extend my gratitude to the rest of the academic staff who probed
interesting questions and made interesting suggestions, the MSL and team for providing
computational resources, and the sponsors who funded my studies.

I would like to acknowledge the difficulties I faced during the research process. With
ambitious plans for radical advances, many ideas were tried but proved fruitless. This
experience has given me a newfound respect for academic research and taught me the
value of tenacity, persistence, and asking for assistance when one is stuck.

As a final note, I’d like to express my appreciation to the academic community at large
for the work they’ve done in the field. I have been able to shape my research and
hone my arguments thanks to the plethora of literature and research I have had at my
disposal.

In conclusion, I am immensely grateful to everyone who helped me along the way. Your
support, encouragement, and insights have been invaluable, and I am truly humbled
by your generosity and kindness.

To God be the glory for giving me the fortitude, patience, and perseverance to finish
this study.

iii


Contents

Preface
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction 1

2 Objective 7
2.1 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Research Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Research Questions for the Multi-View Generation Task . . . . . . 8
2.3.2 Research Questions for the Verification Task . . . . . . . . . . . . 8

3 Background and Related Work 9
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Background (Recurrent Neural Networks) . . . . . . . . . . . . . 10
3.2.2 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Input Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.4 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.5 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.6 Position-wise Feedforward Neural Network (FNN) . . . . . . . . 17
3.2.7 LayerNorm and Residual Connections . . . . . . . . . . . . . . . 17
3.2.8 Final layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Distance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.3 Miners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 Single Token pooling . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Max-pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iv


3.4.3 Average Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.4 Attentive Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.5 Mixed Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.2 Sequential Generation Task . . . . . . . . . . . . . . . . . . . . . 32

3.6 Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.2 Multi-task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.3 Multi-view Learning . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.4 Multi-Model Learning . . . . . . . . . . . . . . . . . . . . . . . . 36

3.7 Math Word Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.1 Mathematics Laws . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.2 Math Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8.1 Encoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8.2 Decoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . 42

3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Data Representation and Multi-view Learning 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Mathematics Expression Notation . . . . . . . . . . . . . . . . . . 49
4.2.2 Conditional Generation . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 Quantity Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 Chain of Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.5 Generation Consistency . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.3 Quantity Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.4 Solution Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.5 Generation Consistency . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.4 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.6 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.7 Expression Notation . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.8 Multi-view Notation . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.1 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

v


5 Multi-tasking Multi-view Decoders 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2.1 Multi-tasking Models . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Multi-Decoder Models . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.3 Verifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.1 Multi-view Generation . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.4 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.5 Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.6 Multi-view and Verification . . . . . . . . . . . . . . . . . . . . . 87
5.4.7 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6 Conclusion 92

References 119

vi


List of Figures

1.1 Many-to-many mapping between questions and solutions. . . . . . . . . 2
1.2 Impact of minor changes on questions. . . . . . . . . . . . . . . . . . . . 2

3.1 Transformer architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Transformer - multi-head attention architecture . . . . . . . . . . . . . . 17
3.3 An example of expression trees . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 An example of the proposed masking technique. . . . . . . . . . . . . . . 47
4.2 Examples of different quantity masking techniques. . . . . . . . . . . . . 54
4.3 The proposed token-level view-consistency strategy. . . . . . . . . . . . . 57
4.4 The proposed multi-view BART model. . . . . . . . . . . . . . . . . . . . 57
4.5 Divergence of context representations during training . . . . . . . . . . . 64

5.1 The proposed multi-view ranking (MVR) model. . . . . . . . . . . . . . . 72
5.2 Examples of expression tree templates . . . . . . . . . . . . . . . . . . . 78

vii


List of Tables

4.1 Notations and quantity masks used to solve math word problems. . . . . 48
4.2 Dataset statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Overall results of the multi-view model compared to baseline models. . . 63
4.4 Performance of each view in the multi-view setup. . . . . . . . . . . . . . 64
4.5 Impact of using a different number of views and view combinations. . . 65

5.1 Overall results of our Multi-view Ranker (MVR). . . . . . . . . . . . . . . 84
5.2 Ablation results for different model setups. . . . . . . . . . . . . . . . . . 87
5.3 Model performance under different learning objectives. . . . . . . . . . . 88
5.4 Qualitative analysis of the models . . . . . . . . . . . . . . . . . . . . . . 89

viii


Chapter 1

Introduction

Solving mathematical word problems is one of the fundamental math concepts taught
in elementary school. Math Word Problems (MWP) are mathematical questions that
are primarily expressed in everyday language rather than using mathematical symbols.
They frequently contain a scenario and a query related to the mathematical relation-
ship between the quantities or objects in the scenario. One needs cognition, language
comprehension, working memory, and attention to solve this [Lin 2021; Verschaffel et
al. 2020; Kula 2007].

MWPs are an interesting benchmark for testing the reasoning ability of language mod-
els, similar to how humans use them [Scheibling-Sève et al. 2020]. In particular, MWP
can prove useful in testing pre-trained language models that have demonstrated similar
language comprehension as humans [Scheibling-Sève et al. 2020; Cobbe et al. 2021;
Wei et al. 2022; He et al. 2020b].

The resolution of MWPs is a significant step towards the development of general intelli-
gence [Fletcher 1985; Wei et al. 2022]. Furthermore, these models can be incorporated
into virtual assistants, search engines, and Computer Algebra Systems to make these
systems more robust, and to help students develop problem-solving skills [Scheibling-
Sève et al. 2020; Morton and Qu 2013; Jeschke and Richter 2007; Kapralov et al. 2020;
Zhang et al. 2020a].

The ability of language models to solve symbolic math problems has been shown by nu-
merous researchers, demonstrating their aptitude for mathematics [Lample and Char-
ton 2019; Saxton et al. 2019; Noorbakhsh et al. 2021]. Symbolic math problems in-
volve manipulating mathematical expressions using symbolic notation to find a solu-
tion. These are often used in advanced mathematics such as calculus, algebra, and dif-
ferential equations. While MWPs require simpler arithmetic operations, they remain an
open problem [Patel et al. 2021; Zhang et al. 2020a]. This indicates that the challenge
MWPs present is textual comprehension rather than purely mathematical [Morton and
Qu 2013].

Most NLP tasks require semantic comprehension, however, MWPs also need mathemat-
ical logic to arrive at the final solution. The mapping between the question and solution
spaces is many-to-many and unrestricted. Questions that have different semantic mean-

1


( number0 + number1 ) * number2

bryan took a look at his books and magazines . if he has
number0 books and number1 magazines in each of his
number2 bookshelves how many books and magazines

does he have in total

melissa scored number0 points in each game . she also got
 number1 bonus points in each game . how many points

did she score in number2 games

the razorback t-shirt shop sells each t-shirt for $ number0
dollars . during the arkansas and texas tech game they
 increased the prices by $ number1 per t-shirt and sold

 number2 t-shirts . how much money did they make from
selling the t-shirts ?

number0 * number2 +
number1 * number2

number2 * ( number1 + number0 )

Distribution

Associativity

Original

Figure 1.1: Many-to-many mapping between questions and solutions. Questions have
different contexts and semantics but the same ground truth solution. Questions ex-
tracted from SVAMP [Patel et al. 2021].

number2 / ( number0 + number1 + number2 ) * 100

In a bag there are number0 red marbles , number1 blue
marbles , and  number2 green marbles . What percent of the

marbles are red ?

In a bag there are number0 red marbles , number1 blue
marbles , and  number2 green marbles . What percent of the

marbles are green ?

number1 / ( number0 + number1 + number2 ) * 100

In a bag there are number0 magenta marbles , number1
blue marbles , and  number2 green marbles . What percent

of the marbles are green ?
number2 / ( number0 + number1 + number2 ) * 100

minor change, same solution

minor change, different solution

(A)

(B)

Figure 1.2: A minor change in the question can lead to a completely different solution.
Questions extracted from SVAMP [Patel et al. 2021].

ings can have the same solution. Due to the laws of distribution and associativity in
mathematical expressions, a single question can have multiple correct solutions.

Figure 1.1 shows examples with significant semantic and structural differences and
yet have the same solution. The model must learn a feature space robust enough to
be invariant of these semantic distinctions. It is also possible that a minor change in
the problem statement could result in a completely different solution. For example, in
Figure 1.2 (A), we can see that changing only one word, the “green” in the question
with “red” leads to a different solution. However, substituting the word “red” with any
color other than “green” will lead to the same solution. Similarly, a minor change in
the solution may or may not lead to a different solution. For example, swapping values
is permissible during addition and multiplication; however, results in errors during
subtraction and division [Sundaram et al. 2022].

2


Given the challenges and utility of MWPs, efforts to automatically solve MWPs date
back to the 1960s [Bobrow 1964; Charniak 1969; Feigenbaum and Feldman 1963].
Manually crafted rules were developed to match the MWPs with the corresponding
mathematical expression that resolves to the answer [Bakman 2007; Mukherjee and
Garain 2008; Yuhui et al. 2010]. For these strategies to be generalizable, the crafted
rules must be exhaustive to cover every possible MWP in an unrestricted language. This
is not feasible in the real world, limiting the number of applications for such a system
[Li et al. 2022a; Sundaram et al. 2022].

The subsequent research adapted the use of statistical and machine learning techniques
to automatically learn the rules relating the MWP to its solution [Kwiatkowski et al.
2013; Goldwasser and Roth 2014]. In this setup, an MWP dataset is used to teach a
model the relationships between the query values and the background information of
the math problems. The model generates the relations as mathematical equations that
answer the query [Zhang et al. 2020a]. Although this configuration is an improvement
over the initial strategy, the models can only solve a small number of problems that are
highly similar to the training data.

The limitation of the statistical methods has led to the development of deep learning
methods to solve MWPs [Wang et al. 2017]. Deep learning is capable of solving sev-
eral mathematical reasoning challenges [Lample and Charton 2019; Saxton et al. 2019;
Henighan et al. 2020; Charton et al. 2021]. However, the datasets used are relatively
large in comparison to the datasets in the MWP domain. The cost of quality anno-
tated MWPs data is greater than that of synthetically generated symbolic mathematics
[Upadhyay et al. 2016; Lample and Charton 2019; Saxton et al. 2019; Hendrycks et al.
2021; Patel et al. 2021].

To deal with limited data that have complex relations to be learned, transfer learning
has been successfully applied in natural language processing (NLP) to achieve state-of-
the-art (SOTA) performance on small datasets [Dai and Le 2015; Devlin et al. 2018;
Lewis et al. 2019; Malte and Ratadiya 2019]. This learning paradigm has been shown
to outperform models that are only trained directly on the MWP datasets [Sundaram
et al. 2022; Lan et al. 2022]. While using these models or parts thereof increases
performance, the models learn shallow heuristics that are sensitive to minor changes in
the math problems [Patel et al. 2021; Shen et al. 2021].

A big leap forward in the application of transfer learning in NLP came with the Trans-
former architecture [Vaswani et al. 2017]. This allowed researchers to pre-train lan-
guage models on available large datasets. These models can further be fine-tuned on
smaller datasets that exhibit similar properties to the pre-training objective.

Particularly in the MWP domain, pre-trained language models (PLMs) have been shown
to outperform models that are only trained directly on the MWP datasets [Sundaram et
al. 2022; Lan et al. 2022; Patel et al. 2021]. Although PLMs elicit better performance,
these models learn shallow heuristics that are sensitive to small changes in math prob-
lems [Patel et al. 2021].

Several strategies have been proposed to make deep learning models learn MWPs more
effectively. Particularly, the sequential encoding and decoding in pre-trained Transform-

3


ers struggle to capture the structural information that MWPs exhibit. Graph encoders
[Li et al. 2020; Zhang et al. 2020c], along with hierarchical tree decoders [Xie and
Sun 2019] and bottom-up Directed Acyclic Graphs (DAGs) [Cao et al. 2021; Jie et al.
2022; Li et al. 2022a] introduce architectural modifications to incorporate structural
information originating from MWPs through of.

The changes made to the encoder/decoder architecture of the PLMs enforce that the
model learns the mathematical relationships between the quantities in a structural
manner. This is something that traditional sequential architecture does not explicitly
require. The architectural modifications, however, are slower to train and too special-
ized for mathematical problems. The goal of this research is to see if the sequence-to-
sequence Transformer can learn deeper heuristics and structural information by using
learning objectives rather than specialized architectural changes.

Specifically, we propose a multi-view generation objective to learn structural informa-
tion and a verification objective to learn deeper heuristics. Our multi-view generation
objective is to learn the prefix, infix, and postfix notations of the ground truth expres-
sion tree of a given MWP. We show that a model trained to generate multiple views
of the abstract syntax tree (AST) demonstrates deeper heuristics than a model that is
trained to generate only one view. Furthermore, the ability to generate multiple views
that evaluate to the same solution shows that the model has some understanding of the
structural properties of mathematical expressions. The multi-view objective has been
proposed in the literature [Zhang et al. 2022a; Shen and Jin 2020]. These proposals
use two views and different architectures for the decoders. Different to these, we use
three views and sequential pre-trained decoders.

Our verification objective, inspired by Shen et al. [2021], trains our model to mark
MWP solutions as correct or incorrect. We show that a model that can generate and
validate expressions has a deeper understanding of the MWP task and does not use
shallow heuristics that generate tokens that could potentially yield the correct solutions
without fully understanding the MWP.

With our multi-view generation objective and a verification objective, we propose a new
multi-view ranking (MVR) framework, where we not only treat these two objectives
as auxiliary tasks but further consider their output to choose the final solution. In
particular, given an MWP, our multi-view generation task makes the prefix, infix, and
postfix views of the AST solution, and we use the correctness score from the verification
task to choose the best view.

We conducted several experiments throughout to show the effectiveness of our strate-
gies. We implement a baseline BART model which achieves an accuracy of 40.6% on the
SVAMP [Patel et al. 2021] dataset, while the SOTA accuracy is at 45.0%. Our implemen-
tation of the multi-view objective improves the baseline’s generation accuracy by 2.3%
while the verification task improves it by 6.0%. Furthermore, we run experiments that
show that using these two objectives to generate the final solution results in an overall
accuracy of 50.3%, an accuracy improvement of 9.7% from our baseline, and 5.3% on
the current SOTA results for SVAMP.

In summary, the contributions of this work are as follows:

4


• We propose a multiview generation objective, where our model has to generate
the prefix, infix, and postfix notations of the solution to MWPs. This objective
exposes the model’s misunderstanding of the MWP, which would otherwise be
concealed by a single-generation objective.

• We also propose refinements to Shen et al. [2021]’s proposed verification objec-
tive, rephrasing it as a metric learning task rather than a classification task. This
objective determines whether the model can distinguish between correct and in-
correct solutions by embedding the correct solutions closer to the MWP than the
incorrect solutions.

• We also propose a multi-view ranking framework that uses the proposed multi-
view generation task to generate multiple candidate solutions and the verification
task to choose the best candidate solution. This framework enables the model to
produce a variety of potential solutions, which the verification task will evaluate
before choosing the best one.

• We conduct experiments that show that our proposed model is capable of cor-
rectly answering a greater number of MWPs than previous models that were con-
sidered to be SOTA.

Having outlined our main contributions, we give a brief overview of the structure of this
work. Chapter 2 is dedicated to an in-depth discussion of the research objectives. This
chapter unfolds as follows: Section 2.1 offers a comprehensive exploration of the re-
search objectives, elucidating the core goals and intentions of this study. Following this,
Section 2.2 formulates and presents the research hypothesis, laying the groundwork for
the central thesis that this research endeavors to test and validate. Subsequently, in Sec-
tion 2.3, we systematically outline the research questions that serve as the investigative
pillars of this work. These questions provide a strategic framework for our inquiry,
guiding us in our quest to unravel the complexities and nuances of our chosen subject
matter.

Chapter 3 lays the theoretical foundations of our work. This chapter starts with a brief
introduction in Section 3.1. We then discuss the Transformer architecture used for all
of our experiments in Section 3.2. Sections 3.3 and 3.4 discuss metric learning and
the pooling mechanism, respectively. These concepts are useful for our verification
objective. Thereafter, we detail NLP concepts and learning paradigms applied in our
research in Sections 3.5 and 3.6 respectively. Section 3.7 briefly discusses MWPs, ex-
pression notations, and expression properties. Thereafter, Section 3.8 relates the Trans-
former architecture to other architectures proposed in the literature. The background
information is then summarized in Section 3.9.

We present our proposed multi-view and verification objectives in Chapters 4 and 5,
respectively. In particular, Chapters 4 and 5 are structured as independently and self-
contained as possible. These two chapters follow a similar structure: they start with
an introduction and motivation section (Sections 4.1 and 5.1); which is followed by
their respective related work (Sections 4.2 and 5.2); the methodology of the chapters
then follows (Sections 4.3 and 5.3); then the experiments and results are discussed
(Sections 4.4 and 5.4); and a conclusion section summarizes the chapter and discusses

5


future work (Sections 4.4 and 5.4).

In particular Chapter 4 aims to explore how data representation impacts the perfor-
mance of Transformers to solve MWPs. It focuses on quantity masking and the expres-
sion notation used in the MWP domain. We propose two major changes: a random
indicator mask and generating three notations/views instead of one. We find that the
multi-view notation setup requires a verifier to realize its full potential. Chapter 5 sets
out to investigate the verification objective and integrate it with the multi-view objec-
tive which results in our multi-view ranker (MVR) model.

Having presented ablation studies on the data representations, and our proposed model
along with its benchmark performance, we conclude our work in chapter 6. Further-
more, we outline the direction for future work based on ours.

6


Chapter 2

Objective

2.1 Research Objective

The primary aim of this research is to enrich our comprehension of the strategies em-
ployed to bolster the performance of pre-trained Transformers in the realm of logical
tasks, with a specific focus on solving Math Word Problems (MWPs). The overarching
objective is to augment the model’s problem-solving abilities by introducing auxiliary
learning tasks, thereby enhancing the heuristics acquired, while refraining from the
necessity of devising specialized neural architectures. In pursuit of this objective, this
study introduces two essential learning tasks: the multi-view generation task and the
verification task.

The multi-view generation task is designed to enable the model to comprehend the
abstract syntax tree (AST) underlying MWPs. This objective necessitates the model to
perceive the solution not through a single sequential lens, but rather through three
distinct views, which mirror the essential components of an AST. The model’s ability
to generate these three views not only indicates its understanding of the structural
relationships within the problem but also its capacity to tackle MWPs with a more
structural and semantic approach.

Simultaneously, the verification task is introduced to help the model discern correct
solutions from incorrect ones. This auxiliary task equips the model with the ability
to identify multiple correct solutions in MWPs, which are rooted in the underlying
mathematical principles governing the expressions.

2.2 Research Hypothesis

Our research is underpinned by the following hypothesis: The performance of the pre-
trained BART model in solving MWPs can be significantly improved by integrating a
multi-view generation and verification objective. Moreover, these auxiliary tasks can
be instrumental in generating the final solution, where the verification task serves as a
mechanism for selecting the most suitable solution from the multiple generated views.

7


2.3 Research Questions

2.3.1 Research Questions for the Multi-View Generation Task

The core research questions addressed in Chapter 4 revolve around the multi-view
generation task, these include:

• Which of the quantity masking techniques available in the existing literature
proves most effective when applied to the pre-trained BART model?

• To what extent can the BART model autonomously generate the prefix, infix, and
postfix views, both individually and collectively, and how does its performance
compare in these scenarios?

• Is it more advantageous for the BART model to generate all three views in unison,
or is it more beneficial to condition the model to generate a specific view based
on a given prompt?

• Does training the BART model to generate consistent state vectors for terms in
mathematical expressions that correspond to the same nodes in the AST result in
improved performance?

2.3.2 Research Questions for the Verification Task

In Chapter 5, the research focus shifts to the verification task, and the following ques-
tions are explored:

• How does the verification objective perform as an auxiliary task when compared
to the baseline BART model and the multi-view objective? Additionally, what is
the combined performance of these two objectives?

• Can the verification task be harnessed to re-rank expressions generated through
beam-search by the BART model based on their correctness, thereby enhancing
overall accuracy? How does this strategy compare to a voting mechanism that
selects the most consistent solution from the beams?

• Does the treatment of the verification objective under a metric learning paradigm
yield superior results compared to a classification paradigm in terms of perfor-
mance and accuracy?

By addressing these research questions, this dissertation aims to advance the field’s
understanding of leveraging auxiliary tasks to improve the performance of pre-trained
Transformers on logical tasks, with a specific focus on Math Word Problems.

8


Chapter 3

Background and Related Work

3.1 Introduction

In this chapter, we introduce the theoretical foundation for this thesis. We start by dis-
cussing the Transformer architecture as an advancement over the traditional sequence-
to-sequence architecture in Section 3.2. This forms the basis of our model. Section 3.3
discusses metric learning and how it can be used to rank and learn better represen-
tations. This research uses metric learning for verification and to enforce consistency
in the multi-view generation objective. Before we can pass the representations to the
verifier, we need to pool from the model representations. Strategies that can be used
are discussed in Section 3.4.

Section 3.5 discusses key concepts of Natural Language Processing (NLP), including
tokenization, greedy search, and beam search. Section 3.6 briefly discusses learning
paradigms and how they can improve the model’s ability to learn the task at hand.
These tasks include transfer learning, multitask learning, multi-view learning, and
multi-model learning.

Section 3.8 gives an overview of the various architectures found in the literature that
solve MWPs. We discuss these architectures and learning objectives in relation to the
Transformer. In section 3.9 we briefly summarize the theoretical background presented
in this chapter.

3.2 Transformers

The Transformer is an attention-based neural architecture designed to address sequence-
to-sequence modeling tasks while also dealing with long-range dependencies. It is the
current state-of-the-art technique in the field of Natural Language Processing and Com-
puter Vision [Lin et al. 2022; Khan et al. 2022; Dosovitskiy et al. 2020; Carion et al.
2020].

This section starts with a discussion of the recurrent structures that were used to process
sequential data before the Transformer architecture introduced in Section 3.2.1. After

9


that, we provide an architectural summary of the Transformer in Section 3.2.2. A
detailed architectural breakdown of the critical parts related to our work is given in
the subsections that follow. In particular, Section 3.2.3 and Section 3.2.4 discuss the
input and positional embeddings, respectively. Thereafter, Section 3.2.5 discusses the
attention mechanism used in Transformers.

3.2.1 Background (Recurrent Neural Networks)

Before the introduction of the Transformers, most state-of-the-art NLP solutions were
based on gated recurrent neural networks (RNNs), such as Long short-term memory
(LSTMs) [Gers et al. 2000; Wen et al. 2015; Chen et al. 2016; Wang and Jiang 2015]
and gated recurrent units (GRUs) [Cho et al. 2014; Chung et al. 2014; Kumar et al.
2016; Zhang et al. 2018]. The vanilla RNNs for sequence-to-sequence modeling encode
the entire input into one context vector that the decoder uses to generate the desired
output. This restricts the amount of information the decoder can use and, as a result,
the performance of the RNNs. An attention mechanism was introduced to allow the
decoder to create a context vector by choosing which encoder states to attend to at
every time stamp [Wang et al. 2016b; Liu et al. 2017; Irie et al. 2016]. This improved
the RNN’s accuracy, however, not its speed during training.

Transformers were introduced to improve the speed of sequential data modelling. They
replace all the recurrent structures in RNNs with attention mechanisms. Vaswani et al.
[2017] showed that provided with enough training data, attention mechanisms alone
can match the performance of RNNs with attention. This eliminated the need to se-
quentially process data during training and allowed researchers to efficiently use par-
allel computing architectures. As such, larger models can be trained on large available
datasets at a lower time complexity.

Sequential Processing

The Feedforward Neural Networks (FNN) are the first and simplest type of layers of
artificial neurons devised [Haykin 1998; Zell 1994; Schmidhuber 2015] where infor-
mation flows in one direction only. However, natural language presents the challenge
of modeling sequential data of uneven lengths. To resolve this, recurrent neural net-
works (RNNs) were introduced. These architectures use a feedback loop to enable data
to be recirculated into the input before being sent again for additional processing and
final output.

The feedback loop allows RNNs to maintain a context vector that contains information
on the data seen before the current step. At step n, the model combines the context
state representing the sequence up to step n − 1 with the input at n to create a new
context state that represents the sentence up to the current step n.

Challenges with RNNs

Theoretically, the information from each step can propagate arbitrarily far down the
sequence if at every point the state continues to encode contextual information about

10


the step. In practice, this mechanism is flawed: the vanishing gradient problem leaves
the model’s state at the end of a long sentence without precise, learnable information
about the preceding sequence. The influence of the earlier steps on the later steps
decreases as the sequence length increases. This causes vanilla RNNs to suffer from
short-term memory.

Gated RNNs

A common way to deal with gradient instabilities is through gates. The gates control
how the context state is updated at each step. Each step can either add, delete, or
leave the context vector unchanged. This reduces the repeated multiplications caused
by the feedback loop. As a result, gated RNNs can model longer-term dependencies.
Two commonly used gated RNNs are Long Short-Term Memory (LSTM) [Hochreiter
and Schmidhuber 1997; Gers et al. 2000] and Gated Recurrent Unit (GRU) [Cho et al.
2014; Chung et al. 2014].

Unresolved challenges by gated RNNs

Gated RNNs achieved state-of-the-art results before the introduction of Transformers
[Salehinejad et al. 2017]. However, these have several drawbacks. The computation of
each step depends on the result of the computation of the previous step. This makes
training RNNs inefficient, as they cannot be parallelized on modern deep-learning
hardware. The other challenge with RNNs is that future input information cannot
be reached from the current state. To resolve this, two RNNs are chained in opposite
directions to form a bi-directional RNN [Bahdanau et al. 2014]. However, these RNNs
are independent, which restricts the contextual information encoded for each step. The
Transformer architecture addresses these challenges by replacing the recurrences in
these networks with attention.

Attention

The attention mechanisms allow a model to create a context vector by drawing from
all states at any point along the sequence. The attention layer has access to all previous
states and can weigh them using a learned relevance metric to provide pertinent data
about distant tokens. Attention has been used with Recurrent Neural Networks (RNNs)
and has been shown to improve performance [Wang et al. 2016b; Luong et al. 2015].
The development of the Transformer architecture showed that attention mechanisms
could function effectively on their own, eliminating the need for sequential recurrent
data processing to produce the performance seen in RNNs with attention. Transformers
employ an attention mechanism without an RNN, processing all tokens simultaneously
and allocating attention weights among them in layers. Since the attention mechanism
only uses information about other tokens from lower layers, it can be computed for all
tokens in parallel at each layer, which leads to improved training speed.

11


Encoder-Decoder Model

The RNN and the Transformer output the hidden state corresponding to each input to-
ken. This means that the output length is equal to the input length. In most sequence-
to-sequence problems, the desired output length often differs from the input length. To
handle this, the model can use a prompt token or end-of-string token to denote that the
input sequence ends, and the model should start generating the corresponding target
sequence. During training, the loss is computed for the tokens after the prompt. Simi-
larly, during inference, the outputs before the prompt are disregarded. This architecture
utilizes the same network to encode the input and generate the output sequence. This
approach is adapted by models such as GPT [Radford et al. 2018 2019].

Alternatively, two networks can be used to encode and generate the target sequence.
This resembles the encoder-decoder architecture. Firstly, the encoder learns the em-
bedding of each token in the context of the whole sequence. Thereafter, the decoder
takes this sequence of embedding or a summary thereof to generate the desired target
sequence of tokens. The decoded sequence is conditioned on all the input tokens and
their previously generated tokens. This variant is widely used for sequence-to-sequence
problems [Mohamed et al. 2021; Tan et al. 2020].

3.2.2 Architecture Overview

The Transformer follows an encoder-decoder architecture that consists of embeddings,
multi-headed attentions, and FNNs. This is depicted in Figure 3.1. Given a sequence of
tokens, the Transformer creates embedding vectors for each token and its position us-
ing an input layer and a positional embedding layer. The embeddings are then passed
through a series of encoder and decoder blocks on the encoder and decoder, respec-
tively. In contrast to RNNs, all inputs are fed into the network simultaneously rather
than sequentially.

Each encoder and decoder block contains attention components that allow each pro-
cessed token to be aware of relevant information from other tokens in the sequence.
The decoder output is passed through a fully connected layer and a softmax function
that converts these to the probability of each token in the vocabulary. Teacher forcing
is used to train the decoder [Raffel et al. 2020], allowing training to occur in parallel.
Since the decoder will not have access to future tokens during inference, an attention
mask is used to hide these tokens so that the model doesn’t attend to them during
training.

3.2.3 Input Embedding

The first layer in the Transformer is the input, or word embedding layer. First proposed
in Bengio et al. [2000], the input embedding layer takes in a tokenized sequence of
numbers in a fixed-size vocab Σ and maps these to vectors of continuous numbers.
These vectors represent the learned meaning of each token, so tokens that are closer in
the embedding space are expected to have similar meanings. The context, semantics,
and syntactic properties, as well as the linear relationship between tokens, can be cap-

12


Figure 3.1: Transformer architecture1

tured by vectors more efficiently than by the discrete entity in the vocabulary [Collobert
and Weston 2008; Mikolov et al. 2013].

More formally, given a vocabulary Σ, and a sequence of discrete tokens X = {x1, x2, ..., xN},
where xi ∈ Σ, the input embedding (IE) layer learns a mapping of each token to a d-
dimensional real-valued vector IE : N → Rd. The dimension of the embedding d,
which represents the width of the Transformer, is a hyper-parameter that is normally
chosen to be much smaller than the vocabulary size (d << |Σ|) [Collobert and Weston
2008; Stahlberg 2020]. For instance, the Transformer vocabulary for language model-
ing typically has up to 50000 tokens although the embedding dimension is within the
range of 512− 1024.

A higher-dimensional embedding can capture fine-grained relationships between words,
however, requires more data to learn. The study carried out by Wies et al. [2021] shows
that the ratio between the dimension and vocabulary of the data and the depth of the
Transformer can have an impact on performance. Furthermore, the dimension d should
be divisible by the number of attention heads.

Press and Wolf [2016] showed that there is a performance gain when the input em-
bedding layer is shared between the decoder and encoder where the input and output
vocab is the same. As a result, it’s been adopted in the design of the Transformer

13


[Vaswani et al. 2017; Lewis et al. 2019; Raffel et al. 2020].

3.2.4 Positional Encoding

The input embedding gives the embedding of each token, however, does not account for
the order or position of each token. However, the meaning can change depending on
the order of the tokens. For example, “a mosquito bit John” and “John bit a mosquito”
have the same tokens, yet, they mean different things due to the order of the tokens.
RNNs can implicitly learn positional information from their sequential processing. Dif-
fering from RNNs, Transformers process data in parallel using an attention mechanism
that does not have a notion of order. As such, the transformer explicitly adds positional
information to the input embeddings through this module. Positional encodings PE
map the position of each token to a d-dimensional real-valued vector PE : R → Rd.
This is then summed with the input embedding E = IE + PE to produce embeddings
with both positional and contextual information. Alternatively, these two vectors can
be concatenated; however, this will increase the memory footprint and will slow down
training, as the optimization process needs to adjust more parameters due to the extra
dimensions.

The original implementation of the Transformer model used sine and cosine functions
of different frequencies to add positional information denoted with the following equa-
tion:

PE(pos,i) =

{
sin(pos/100002i/d) if i = 2k

cos(pos/100002i/d) if i = 2k + 1

where pos denotes the position of the token in the input sequence, i is the index of
each dimension in the output vector, and k is the offset. This positional encoding
can capture absolute and relative positional information. Furthermore, it can scale
to unseen lengths of sequences. Vaswani et al. [2017] also experimented with learned
positional embeddings [Gehring et al. 2017a] and found that the results are on par with
the sinusoidal encoding proposed. Subsequent work [Devlin et al. 2018; Radford et al.
2018; Lewis et al. 2019] adopted the learned positional embedding. Wang and Chen
[2020] do an analysis of positional embeddings under different training objectives. The
positional encoder is not shared between the encoder and the decoder, in contrast to
the input embedding layer. The fully encoded input tokens E are then passed to the
multi-headed self-attention layer.

3.2.5 Attention Mechanism

The attention mechanism is a central component of the Transformer architecture, and
its operation varies between the encoder, decoder, and the encoder-decoder model. The

1Adapted from https://d2l.ai/chapter_attention-mechanisms-and-transformers/

transformer.html

14

https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html
https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html


Transformer treats the input representations E as a set of key-value pairs. For each po-
sition j in E, the attention mechanism queries its relationship to all representations in
E. This is known as the self-attention mechanism since queries and key-value pairs are
of the same sequence. This is used in the encoder and decoder to learn the relationship
between the inputs. Furthermore, the decoder has a cross-attention mechanism where
queries come from its inputs and key-value pairs come from the encoder.

More formally, the attention mechanism maps a query vector Q and a set of key-value
vector pairs (K,V ) to an output vector A. The output is generated by computing a
weighted sum of the values. The weight assigned to each value is determined by a
compatibility function of the query with the corresponding key. These weights are
called attention scores [Vaswani et al. 2017]. Since Q,K and V are d-dimensional,
they can be organized into matrices Q ∈ RM×d, K ∈ RN×d and V ∈ RN×d, where M is
the query sequence length, and M = N for self-attention.

Scaled Dot-Product Attention

The Transformer adopts a scaled version of the dot product attention [Luong et al.
2015] to compute the attention scores. Figure 3.2b depicts this, and can be denoted
with the following equation:

A = Attention(Q,K, V ) = softmax
(
QKT

√
d

)
V

The similarity between Q and K is computed as their dot-product scaled by 1/
√
d. This

is then passed through a softmax function to produce the attention scores. The dot-
product of the attention scores and V gives the output matrix A ∈ RM×d. The scaling
prevents the softmax function from giving values close to 1 for highly correlated vectors
and values close to 0 for non-correlated vectors, making gradients more reasonable for
back-propagation. The softmax function normalizes over the columns of the scaled dot
product of Q and K so that the weights for each query vector sum up to one.

Multi-Head Attention

Instead of performing a single attention function with Q, V , and K, the Transformer
performs the scaled dot-product attention h times using separate identical attention
layers. A single layer is called a head and h is the number of heads in multi-head
attention (MHA), represented by Figure 3.2a.

For each head, learned linear projects are used to transform Q,K, and V to a d/h-
dimensional space, then passed through the scaled dot-product attention in parallel,
yielding h attention heads with d/h dimensions. These are concatenated and projected
again to a d-dimension matrix. The total computational complexity is comparable to
that of single-head attention with full dimensionality because of the reduced dimension
of each head.

Formally, MHA can be defined as

MHA = Concat(head1, ...,headh)W
A

15


where WA ∈ Rd×d and headi = Attention(QWQ
i , KWK

i , V W V
i ), WQ

i ,WK
i ,W V

i ∈ Rd×d/h

for i ∈ [1, h]. With MHA, the model can look at information from different represen-
tation subspaces at the same time. Voita et al. [2019] and Clark et al. [2019] provide
more information on the different semantics captured by each head.

Masked Multi-Head Attention

The encoder and decoder have different requirements for attention due to their distinct
roles in the Transformer model. While the encoder operates on an input sequence,
the decoder generates an output sequence. To address this, the model uses masked
multi-head attention.

In the encoder, there is no need for masking since all positions in the input sequence
can be considered during the self-attention process. However, in the decoder, the self-
attention mechanism must prevent information flow from future tokens to the current
token, to maintain causality. This is achieved by applying an attention mask. When gen-
erating token ŷt at time step t, the model should only attend to the tokens preceding it,
i.e., ŷ0, y1, ..., ŷt−1. To enforce this constraint, the attention mask sets all future atten-
tions (i.e., those for t+ 1, t+ 2, ..., T ) to negative infinity in the input of the softmax.
As a result, these future attentions receive zero attention scores and do not influence
the current token’s generation.

Beyond temporal masking, attention masks are also useful for cases where sequences in
a batch have varying lengths. To handle this, special padding tokens are often added to
sequences, and the model should be prevented from attending to these padding tokens.
The attention mask serves this purpose as well.

Self-Attention

The Transformer employs self-attention [Lin et al. 2017b; Cheng et al. 2016; Parikh et
al. 2016] in the encoder and a masked version thereof in the decoder. The representa-
tion of each position in the sequence is computed by relating it to every other unmasked
position in the sequence. The Transformer entirely depends on the attention mecha-
nism as opposed to RNNs and Convolutional Neural Networks (CNNs) [Gehring et al.
2017b; Kalchbrenner et al. 2016]. Self-attention is free from the assumptions made
by RNNs and CNNs that the embedding of one token depends more on its neighboring
tokens [Dou 2022].

Cross-Attention

The second MHA layer in the decoder applies cross-attention, where the queries are
derived from the decoder and the key-value pairs from the encoder’s final outputs. This
module is the only connection between the encoder and the decoder and serves to
mix the two embedding sequences. As such, the decoder has access to its previously
generated tokens through self-attention and the input tokens via this module.

In summary, self-attention is used in the encoder to model relationships within the in-
put sequence, while masked self-attention is applied in the decoder to ensure causal

16


(a) Multi-Head Attention (b) Scaled Dot-Product
Attention

Figure 3.2: Transformer - multi-head attention architecture. Adapted from Vaswani et
al. [2017].

generation of output tokens. Cross-attention, on the other hand, bridges the gap be-
tween the encoder and the decoder, enabling the decoder to incorporate information
from both its own generated tokens and the encoder’s output sequence. This distinction
in how attention is used between the encoder, decoder, and encoder-decoder model is
a key feature of the Transformer architecture.

3.2.6 Position-wise Feedforward Neural Network (FNN)

At the end of each encoder/decoder layer, the Transformer inserts a fully connected
feedforward network that is applied to each position separately and identically. This
consists of two linear transformations with a non-linear activation in between, usually
a ReLU [Nair and Hinton 2010] or GeLU [Hendrycks and Gimpel 2016]. This incorpo-
rates non-linearity into the model explicitly and strengthens the model capacity [Kim
2022].

3.2.7 LayerNorm and Residual Connections

To improve the stability of training, the “Add & norm” applies a residual connection [He
et al. 2016] and layer normalization [Ba et al. 2016] in the output of the MHA module
and the feedforward module. Residual connections add the input of a module to its
output, improving gradient propagation. This addresses the problem of deterioration
of the training accuracy as more layers are added to the neural architecture [He et al.
2016]. Layer normalization adjusts the mean and standard deviation of the output of
a layer, which can accelerate the training. The output of the “Add & Norm” can thus be
defined as:

LayerNorm(ZR) =
ZR − E[ZR]√
V ar[ZR] + ϵ

· γ + β

17


where γ, β are learnable parameters, ZR = Z + ZO is the residual output with Z being
the input to each sublayer (MHA or feedforward), and ZO is the output from these
sublayers. A small value ϵ is added in the denominator for numerical stability. The
mean E[ZR] and variance V ar[ZR] are respectively denoted by the following:

E[ZR] =
1

N

N∑
i=1

Z
(i)
R V ar[ZR] =

1

N

N∑
i=1

(
Z

(i)
R − E[ZR]

)2
.

3.2.8 Final layer

The output of the final decoder layer passes through a final layer, which acts as a
classifier/output embedding layer. This maps the final hidden states of the decoder
from d-dimension back to the vocabulary size. The classifier output is then fed into a
softmax layer, which will produce probability scores between 0 and 1. The index of the
highest probability score is taken as the predicted token. Following the works of Press
and Wolf [2016], the output embedding layer is tied to the input embedding which
reduces the number of parameters and also improves performance.

3.3 Metric Learning

In Chapter 5 we propose to solve Math Word Problems as a verification task. Verification
can be addressed as a metric learning task, a classification task, or more recently, a
disentangled representation learning task [Ridgeway and Mozer 2018; Lee et al. 2020].
This section provides insights into the fundamentals of deep metric learning methods
related to the verification setup in Chapter 5. The goal of the metric learning algorithm
is to learn a metric that assigns a small distance to similar points and a relatively large
distance to dissimilar points.

The training can follow a weakly supervised or supervised setting. In the weakly super-
vised setting, the goal is to learn a distance metric that puts positive pairs close together
and negative pairs far away, while in the supervised setting, the goal is to learn a dis-
tance metric that puts points with the same label close together while pushing away
points with different labels. This research focuses on using a supervised setting.

More specifically, for a set of data points X = {x1, x2, ..., xN}, where xi ∈ Rd and
their corresponding discrete labels Y = y1, y2, ..., yN ∈ Z+, the goal is to train a model
Fθ(·) : X → Rd together with a distance D : R → R, such that for any pair of points
xi, xj ∈ X, D(xi, xj) is smaller if the corresponding labels are the same, that is, yi = yj,
yi, yj ∈ Y , otherwise it should produce a larger value. Here θ represents the learned
weights, d is the embedding dimensions, and N is the number of data points. Thus,
we need to choose the right distance function D, and the loss function L to train our
model Fθ [Kha Vu 2021]. The distance metric D is discussed in Section 3.3.1, followed
by loss functions in Section 3.3.2 and how to choose training samples in Section 3.3.3.

18


3.3.1 Distance Metric

A Metric is a non-negative function D between two points xi and xj that describes the
notion of ‘distance’ between these two points. Let X be a nonempty set. A distance or
metric over X is a map d : X ×X → R that satisfies the following properties:

• Non-negativity: The distance between two distinct points is always positive,
D(xi, xj) > 0 and D(xi, xi) = 0.

• Symmetry: The distance from xi to xj is always the same as the distance from xi

to xj, i.e., D(xi, xj) = D(xj, xi).

• Triangular inequality: The sum of distances between two lines drawn from three
points cannot be greater than the third line, D(xi, xj) ≤ D(xi, xh) + D(xh, xj),
∀xi, xj, xh ∈ X.

One other property of distance metrics not mentioned above is the identity of indis-
cernibles, which states if D(xi, xj) = 0, then xi = xj. This property does not need to
hold for metric learning; as such, we consider pseudo-distances [Collatz 2014]. Further-
more, for the purpose of training our models, it is usually sufficient to calculate some
transformation of the distance metric that still gives us a similarity measure between xi

and xj.

Euclidean Distance

Euclidean distance is one of the most common distance measures. Also known as the
L2 metric, this can be described as a measure of the length of a line segment between
two points in Euclidean space. Using the Pythagorean theorem [Weisstein 2006], the
square of the euclidean distance can be calculated from the Cartesian coordinates of
the points:

D(xi, xj) = ||xi − xj||2

The square of the euclidean distance is often used since it avoids computing the square
root, making it more efficient from a computation perspective. Although it is not really
a distance, it retains the most useful properties of a distance metric from a learning
perspective. The Euclidean distance is not scale invariant, which can result in skewed
distances. As such, it is beneficial to standardize/normalize before use. Moreover, as
the dimensionality d increases, the less useful Euclidean distance becomes less mean-
ingful. It fails to capture non-linearity in data since the distance is isotropic, i.e., the
same in every direction.

The Mahalanobis distance is a more general case of the euclidean distance that is non-
isotropic, given by D(xi, xj) = ||Wxi − Wxj||2, where W is a linear transformation
matrix found by decomposing the inverse of the positive-definite covariance matrix
M = W TW of a given probability distribution. For the euclidean distance, W is the
identity matrix. This distance is prone to overfitting due to its dependence on the
mean, and the linear transformation W is not always able to capture the inherent non-
linearity of the data.

19


Cosine Distance

The cosine distance uses cosine similarity to determine the distance between data
points. Cosine similarity can be defined as the cosine of the angle between two vectors
in an inner product space, which can be computed as the normalized Euclidean dot
product of two vectors:

sim(xi, xj) =
xi · xj

||xi||||xj||

Cosine similarity ranges between -1 and 1, where 1 means the angle between xi and
xj is 0, i.e., they are similar. To ensure that cosine similarity is non-negative, we can
define cosine distance as:

D(xi, xj) = 1− sim(xi, xj)

The cosine distance measures the dissimilarity between xi and xj. Since the cosine
similarity is computed in the Euclidean space, we can derive the equivalent Euclidean
distance to the cosine distance. When xi, and xj are normalized to unit vectors such
that ||xi|| = ||xj|| = 1 we have:

||xi − xj||2 = ||xi||2 + ||xj||2 − 2xi · xj = 2(1− sim(xi, xj))

As we can see, the cosine distance is given by ||xi−xj ||2
2

. While the cosine distance is
not a metric itself, we can easily see that it is a square of the Euclidean distance, and
the square root thereof produces a distance metric [Schubert 2021]. This is suffices for
metric learning.

From a computation perspective, a dot product is faster because it saves on one vector
subtraction and does not suffer from “catastrophic cancellation” for small distances
[Schubert and Gertz 2018; Lang and Schubert 2020]. As such, it is more efficient to
compute the Euclidean distance as a dot product.

The unnormalized Euclidean distance is commonly used on dense, continuous variables
where dimensional magnitudes matter, whereas the normalized Euclidean distance is
commonly used on sparse, discrete domains such as text since it was more robust to
variations in occurrence counts between semantically related terms.

Similar to the Euclidean distance, cosine distance suffers from the curse of dimension-
ality, though with a lesser effect. As the dimensionality increases, the number of data
points required for good performance of any machine learning algorithm increases ex-
ponentially. Chen et al. [2020b] found that projecting the embedding vectors Fθ(·) to
a lower dimension before applying the distance measures produces better results. To
handle this, data mining methods, together with loss functions are implemented to be
more robust to this.

3.3.2 Losses

The loss function is an objective function that our model learns to minimize or maxi-
mize. This function produces a scalar value that indicates how the model fits the data;

20


therefore, improvements in this number correspond to a better model [Goodfellow et al.
2016; Reed and MarksII 1999]. The metric learning objective function predicts relative
distances between inputs, which differs from other objective functions for classification
and regression problems, whose objective is to learn to predict a label, a value, or a set
of values given an input.

Standard regression and classification algorithms can be categorized as point-wise loss
functions, where a single data point is considered at a time through the loss function.
On the other hand, metric learning considers a pair or a list of data points at a time.
That is, the metric learning objective trains the model Fθ to produce similar represen-
tations by minimizing the embedding distance when the data points have the same
label, otherwise, the model should produce distant representations, i.e., maximize the
embedding distance.

Pairwise Loss

The pairwise loss function dates back to one of the first metric learning losses proposed
[Chopra et al. 2005; Hadsell et al. 2006]. It follows a simple and intuitive objective
function to directly minimize the pairwise distance between two samples, x1 and x2,
with the following formulation:

L = Iy1=y2D(x1, x2) + Iy1̸=y2max (0,m−D(x1, x2))

where m > 0 is a set hyper-parameter margin, defining the lower bound distance be-
tween samples of different labels. When the representations produced for a negative
pair are sufficiently distant, no effort is wasted in enlarging that distance, so further
training can focus on more difficult pairs.

Triplet Loss

The pairwise loss uses two samples, however, more samples can be used. For instance,
Wang et al. [2014], Schroff et al. [2015], Cui et al. [2016] and Hermans et al. [2017]
use triplets instead of pairs and reported improvements in performance. Triplet-based
losses consist of an anchor x, a positive sample x+, and a negative sample x−. The
triplet loss learns a distance metric by which the anchor is further from the negative
samples than it is from the similar point by a margin D(x, x−) > D(x, x+)+m. The loss
can be formulated as:

L = max(0,m+D(x, x+)−D(x, x−))

where m is the margin. The margin prevents a trivial solution to the loss function which
is found when both of the distances are equal to zero D(x, x+) = D(x, x−) = 0 Since the
network is updated with both similar and dissimilar samples, the triplet loss requires
fewer samples for convergence. However, the performance of the triplet loss is highly
dependent on the sampling process of x+ and x−.

If the positive is already closer to the anchor than the negative, the model will learn
almost nothing since the requirement on the relative distance of the embeddings is

21


already satisfied; however, if the sampled positive is far away from the anchor or a
negative very close to it, the model improves much more during each training step
[Kaya and Bilge 2019].

The triplet loss only compares a sample to one negative sample during one update,
ignoring other negative samples. The embedding vector for a sample is only guaranteed
to be far from the selected negative sample, not from the all other negatives. Thus, an
example can be distinguished from a few negative samples while remaining close to
many other negatives. After looping over enough randomly sampled triplets, the final
distance metric should be balanced; however, individual updates may be unstable and
converge slowly [Sohn 2016; Weng 2021].

N -pair Loss

Sohn [2016] proposed to extend the triplet loss to include a comparison with multiple
negative samples. The proposed N -pair loss, uses one anchor x, one positive sample x+

and N − 1 negative samples {x−
i }N−1

i=1 . If N = 2, the N -pair loss reduces to triplet loss.
Mathematically, this can be denoted as follows:

L = log

(
1 +

N−1∑
i=1

exp(D(x, x−
i )−D(x, x+))

)

= − log
exp(D(x, x+))

exp(D(x, x+)) +
∑N−1

i=1 exp(D(x, x−
i ))

The equation produced by N -pair loss is similar to the softmax loss. Although the
softmax loss may assist the model being trained to converge to the optima, it does not
seek to keep positive pairs closer and negative pairs farther similar to what metric losses
do. Softmax loss only learns indistinct features. As a result, the methods use metric
learning with softmax. Several proposals [Goldberger et al. 2004; He et al. 2020a;
Oord et al. 2018; Gutmann and Hyvärinen 2010 2012] follow the same structured
loss function, with different distance measures D. Increasing the number of negative
samples N has been found to improve the contrastive power of the model [Khosla et al.
2020; He et al. 2020a; Chen et al. 2020b; Tian et al. 2020; Henaff 2020].

Supervised Contrastive Learning

A more generalized form of N -pair loss was proposed by Khosla et al. [2020], which
can cater to multiple positive samples in addition to multiple negatives. The proposed
SupCon loss aims to improve the weaknesses of the N -pair loss and the softmax func-
tion. The heavy dependence on hard mining from the N -pair loss is minimized by using
many positives and many negatives for each anchor. Moreover, the softmax loss func-
tion was shown to outperform the N -pair loss for classification tasks [Horiguchi et al.
2019]. However, the softmax loss is prone to produce poor margins. The SupCon loss
is designed so that it can effectively use the ground truth targets to embed samples with
the same label closer than samples with different labels. The mathematical derivation

22


is given by

L =
2N∑
i=1

1

2|Ni| − 1

∑
j∈Ni,j ̸=i

log
exp(D(xi, x

+
j ))∑

k∈I,k ̸=i exp(D(xi, x
−
k ))

where Ni = {j ∈ I : ỹi = ỹj} contains a set of labeled sample indices and |Ni| is its
cardinality. Khosla et al. [2020] found that including more positive samples into the
set Ni leads to better results. Furthermore, SupCon was found to be less sensitive to
hyper-parameter changes and noise [Khosla et al. 2020].

Multi-Similarity Loss

The drawbacks of conventional loss functions such as triplet loss and contrastive loss,
which sometimes lack discriminative power when applied to large datasets, are ad-
dressed by the multi-similarity (MS) loss [Wang et al. 2019b]. Traditional loss func-
tions, which are limited in their ability to capture complex similarity relationships
in high-dimensional feature spaces, only consider pairwise similarity relationships be-
tween samples.

The idea behind MS loss is to measure similarity in more than one direction, as opposed
to just one or two, as is the case with conventional loss functions. The loss function
then encourages the model to learn a representation of features that maximizes the
similarity between positive examples and minimizes the similarity between negative
examples.

L =
1

|X+ ∪X−|

[
1

α
log[1 +

∑
(x,x+)∈X+

exp(−α(D(xi, x
+
j )−m))]

+
1

β
log[1 +

∑
(x,x−)∈X−

exp(β(D(xi, x
−))−m))]

]
where α and β are hyper-parameters.

3.3.3 Miners

Deep metric learning relies on the ability to learn meaningful representations of data,
where the similarity or dissimilarity between data points is crucial. Unlike conven-
tional classification or regression tasks, in which random sampling of data points often
suffices, deep metric learning requires a more nuanced approach to sample selection.
The reason lies in the fundamental objective of the model: instead of predicting labels,
it aims to understand the relationships between data points, learning to push similar
items closer and dissimilar ones apart [Bucher et al. 2016; Kaya and Bilge 2019; Wang
et al. 2019b].

The effectiveness of the deep metric learning network depends not only on the quality
of its mathematical models and architectures but also on the discriminative power of
the samples it encounters during training. To facilitate the network’s learning and
enhance the quality of the learned representations, it is imperative to provide it with
training instances that challenge its ability to distinguish between data points.

23


In this context, sample selection plays a pivotal role as a preprocessing step before
applying the deep metric learning model. The process of selecting the right training
examples to feed the network is akin to setting the stage for the model’s success [Schroff
et al. 2015].

To train the model effectively to learn a meaningful distance metric, it is essential to
sample pairs of positive and negative data points from the datasets. This process of
sample selection and pair creation is often referred to as ”mining.” The importance of
mining in the context of deep metric learning has garnered considerable attention in
the research community, with various techniques and strategies proposed to optimize
the learning process [Wang et al. 2018a 2014; Wu et al. 2017; Hermans et al. 2017;
Schroff et al. 2015; Sikaroudi et al. 2020].

Anchor, Positive and Negative Samples

Let X be the set of examples and Y be the set of corresponding labels, denoted as
{(xi, yi)}. Within this context, we define the sets of pairs as follows: The positive pair
set, denoted as X+, consists of pairs where the labels match, that is, X+ = {(x, x+) ∈
X ×X | y = y+}. The negative pair set, denoted as X−, comprises pairs with differing
labels, which can be represented as X− = {(x, x−) ∈ X ×X | y ̸= y−}.

In this context, the samples x, x+, and x− are typically referred to as the anchor, posi-
tive, and negative samples. The primary objective is to train the model, represented as
fθ, such that for every anchor, the positive sample embeddings, as measured by some
distance function D, are closer to the anchor than the negative sample embeddings by
a margin m. Formally, this can be expressed as:

D(x, x+) +m < D(x, x−), ∀x ∈ X, (x, x+) ∈ X+, (x, x−) ∈ X− (3.1)

Easy, Semi-hard and Hard Samples

Given an anchor, positive, and negative sample combination (x, x+, x−), the inequality
in Equation (3.1) from the previous section may or may not be violated. When the
inequality is not violated, the positive and negative samples (x+, x−) are considered
easy. Since the model already gets them right, easy samples do not improve the network
when trained.

Semi-hard samples occur when the inequality in Equation (3.1) is only violated by the
margin m, that is, D(x, x+) < D(x, x−) < D(x, x+) +m. In other words, the negative
sample is within a marginal distance away from the positive.

When the negative sample is closer to the anchor than the positive sample, (x+, x−) is
considered a hard sample. This is denoted as D(x, x−) < D(x, x+). The semi-hard and
hard samples represent samples that the model has not yet mastered. Therefore, we
need to mine these samples so that we can train the model on them.

24


Offline Mining

Mining for hard samples can be done offline before the batch is created. The whole
training set is passed through the model in order to obtain the model embeddings.
Subsequently, the distances are computed, where the sets of anchor, positive and neg-
ative (x, x+, x−) are dropped if they are easy. Different implementations keep only the
hardest, hard, semi-hard, or a combination of these. The remaining valid sets are di-
vided into K batches. In this way, offline mining creates batches that have the most
informative samples.

This technique computes the embeddings of the full training set to generate batch sam-
ples, which have some memory and computation overhead. Valid samples can change
after the batch gradient updates. This means that the offline mined sets need to be
updated frequently, after several steps, or at the end of every epoch to avoid overfit-
ting. This renders offline mining inefficient. Offline mining can be adopted when the
datasets are small or if the hardness of the samples can be computed independently of
the model embeddings where these drawbacks are less dire [Prétet et al. 2020].

Online Mining

Alternatively, mining can be done online, within each batch. First introduced in Schroff
et al. [2015], online mining aims to reduce the computation and memory burden of
offline mining by only considering the positive and negative samples in the sampled
batch. Furthermore, this method uses up-to-date weights of the model to compute the
embeddings. For each class label, n samples are selected for every batch. This provides
the anchors and positive samples. Every other sample in the batch with a different label
is considered a negative sample. This will generate easy, semi-hard, and hard samples.
All valid triplets or only the hardest samples are used in the loss function to update
the weights, where the latter often performs best [Schroff et al. 2015; Hermans et al.
2017].

The online option is the preferable choice for larger datasets since it can be compu-
tationally prohibitive to compute and sort the pair-wise distances between all samples
[Schroff et al. 2015; Musgrave et al. 2020]. Online mining has been well adopted in the
literature [Wang et al. 2019c; Liu et al. 2019a; Lee et al. 2018; Chen et al. 2020a]. How-
ever, online mining requires big batch sizes to include enough distinct negative data to
push the model to acquire meaningful representation to distinguish various examples
[Oh Song et al. 2017; Kaya and Bilge 2019].

Online Batch Hard Mining

In the online batch hard mining, we seek to find the hardest positive and hardest nega-
tive for each anchor. A positive sample x+

h and a negative sample x−
h are said to be the

25


hardest if the following holds:

x+
h = argmax

(x,x+)∈X+
B ,(x,x−)∈X−

B

D(x, x+)−D(x, x−)

x−
h = argmin

(x,x+)∈X+
B ,(x,x−)∈X−

B

D(x, x+)−D(x, x−)

where X+
B and X−

B are subsets of the positive and negative sets, X+
B ⊆ X+ and X−

B ⊆
X− respectively, within the current batch. Mining for the hardest samples is computa-
tionally taxing, however, relatively less compared to doing it offline. Furthermore, the
hardest sample for online mining is relative to the batch, which makes it a moderate
sample and less prone to causing the model to overfit.

Hard sampling provides the biggest gradient updates, which can speed up convergence.
However, mining very hard samples early on during training can cause the model to
get stuck on local minima, and consequently, lead to model collapse [Hermans et al.
2017; Harwood et al. 2017; Wu et al. 2017]. Semi-hard sampling avoids the model
overfitting on outliers and improves its robustness. However, only a few semi-hard
samples can be generated, requiring large batch sizes. A curriculum where harder
samples are gradually increased was found to perform better than semi-hard mining
[Harwood et al. 2017].

Online Batch All Mining

Rather than mining for the hardest samples, all the valid combinations can be con-
sidered for the loss computation. This strategy is known as batch-all mining. For B
samples in a batch XB and n samples for each label, all the positive samples X+

B =
{(x, x+) ∈ X+|(x, x+) ∈ XB} and negative samples X−

B = {(x, x−) ∈ X−|(x, x−) ∈ XB}
we have a total of B(B − n)(n− 1) triplets that contribute to the loss.

The batch-all approach results in less computational overhead in comparison to batch-
hard online mining. However, this approach was found to perform consistently worse
than batch hard online mining [Ding et al. 2015; Wang et al. 2016a]. Hermans et al.
[2017] found that the inclusion of easy triplets was the major cause and only consid-
ering combinations that violate the loss function improves performance. Khosla et al.
[2020] and Wang et al. [2019b] propose loss functions that intrinsically perform hard
mining by downweighing the contribution of easy samples.

Multi-Similarity Miner

Wang et al. [2019b] proposed a mining technique in which negatives are selected if
they are closer to an anchor than its most difficult positive, and positives are selected
if they are further away from an anchor than its most difficult negative. More formally,
for positive samples X+

B ⊆ X+ and negative samples, X−
B ⊆ X− in a batch, and a

26


margin m, (x, x+) and (x, x−) are mined for the loss if the following holds:

D(x, x−) > min
(x,x+)∈X+

B

D(x, x+)−m

D(x, x+) < max
(x,x−)∈X−

B

D(x, x−) +m

3.4 Pooling

Processing sequential data yields the same number of embedding vectors as there are
tokens in the input sequence. That is, the embeddings are at a token level with a
variable length. It is often desired to distill the information encoded by each of the
token embedding vectors into one fixed-length vector that represents the context of
the sequence, that is, a sequence embedding. The context vector can then be used for
classification and metric learning tasks such as sentiment analysis, natural language
inference, and text similarity [Farouk 2019; Kowsari et al. 2019; Williams et al. 2017;
Reimers and Gurevych 2019].

Pooling is a procedure that can be used to produce a representative context vector.
Several pooling methods can be used, such as simply taking the average of all the tokens
or taking the max value at each dimension across all tokens. Reimers and Gurevych
[2019] found that the significance of the pooling procedure depends on the objective
function and the datasets used. As a result, it is worth experimenting with different
pooling strategies.

In the following subsections, we discuss several pooling methods that can be used to
create a context vector. In particular, we discuss single token pooling (Section 3.4.1),
max-pooling (Section 3.4.2), average pooling (Section 3.4.3), attentive pooling (Sec-
tion 3.4.4), and mixed pooling (Section 3.4.5).

3.4.1 Single Token pooling

Early sequential processing techniques, such as the RNNs produced a context vector at
the end of a sequence that stored the information of all the tokens it saw. As such, for
a sequence that is concatenated by two special tokens, a beginning-of-string token bos
at the beginning and an end-of-string token eos at the end, the context vector in RNNs
is the eos token. Transformers on the other hand do not have this leverage.

Devlin et al. [2018] train their proposed Transformer architectures to produce a context
vector through the bos token. The bos token, also called the CLS token, is more consis-
tent than the eos token since it occurs at the same position every time. performs well
for the classification objective under which it was trained, however, it produces poor
results for metric learning tasks [Reimers and Gurevych 2019]. Furthermore, pooling
the bos token models that were not pre-trained to produce a context vector through this
token can lead to poor performance.

27


3.4.2 Max-pooling

Another common strategy to produce a compressed representation is to pool the maxi-
mum value over time for every dimension [Kalchbrenner et al. 2014; Joulin et al. 2016].
Formally known as max-pooling [LeCun et al. 1998], this method extracts the most
prominent features across all dimensions. Furthermore, max-pooling is invariant to
sentence length, as opposed to average pooling and attentive pooling [Suárez-Paniagua
and Segura-Bedmar 2018]. Max-pooling is particularly well suited to the separation of
features that are very sparse [Boureau et al. 2010ba].

With very long sequences, the information from most token embeddings will be dropped.
To preserve more information, the more general k-max-pooling method can be em-
ployed [Kalchbrenner et al. 2014]. For a sequence with N d-dimensional token em-
beddings EN ∈ Rd and k ≤ N , the k-max-pooling operation pools the k most active
features in E, producing k fixed vectors Ek

max ∈ Rd. The hyper-parameter k can be set
statically or dynamically as a function of N to allow for smooth extraction of features
[Kalchbrenner et al. 2014; Shu et al. 2018]. To produce one context vector, Zhao et al.
[2018] sums Ek

max and scales this by a factor s. If s = k, the context becomes the mean
of Ek

max, and if k = N , then the scaled k-max-pooling is equivalent to average pooling.

3.4.3 Average Pooling

The average pooling [LeCun et al. 1998 1989; Reimers and Gurevych 2019] computes
the context vector by taking the element-wise arithmetic mean of the token embed-
dings. Different from max-pooling, average pooling down-weighs the activated features
by combining the non-maximal features. This makes the average pooling procedure
more suited for preserving background information.

3.4.4 Attentive Pooling

Another option to produce the context vector is to use attention [Luong et al. 2015;
Bahdanau et al. 2014] to get a weighted sum of the inputs instead of the average
pool. Not all tokens in the sequence contribute equally to the meaning of the sentence.
The attentive pooling mechanism learns to weigh the relevant tokens more than the
irrelevant ones to produce a context embedding. The relevance weights are learned
alongside the model. Given a sequence of N embeddings E = {e1, e2, ..., eN}, the
context vector is produced by the following:

Eseq = ΣN
i=1αiei

αi =
exp(u⊤

i us)

Σ exp(u⊤
i us)

ui = tanh(Wsei + bs)

where Eseq is the sentence embedding, αi is the attention weight corresponding to the
ith token embedding, and Ws, bs, and us are trainable parameters. The sentence-level
context vector us can be randomly initialized and jointly learned during the training
process [Yang et al. 2016; Wu et al. 2020a]. Alternatively, us can be set to be the unit

28


matrix [Lin et al. 2017b; Chen et al. 2018]. The attention mechanism can be extended
to multi-head attention [Chen et al. 2018; Wu et al. 2020c].

Different from the mean and max pools, the attentive pool contains trainable param-
eters. This allows the model to choose which tokens it finds the most useful for the
sequence context at the expense of additional parameters and possibly training time.
Furthermore, under low data settings, the attentive pool can lead to over-fitting [Li et
al. 2018].

3.4.5 Mixed Pooling

It is also possible to use more than one pooling technique. For instance, Yu et al. [2014],
Lee et al. [2016], Zhao et al. [2018] and Zhou et al. [2021] saw performance gains
when they combined average pooling and max-pooling. This combines the advantages
of each strategy. The max pool or average pool can be used as the query vector for
an attentive pooling procedure [Maini et al. 2020; Liu et al. 2016]. Sahu and Anand
[2018] and Suárez-Paniagua and Segura-Bedmar [2018] concatenates the results of
multiple pooling strategies. Concatenating these methods, however, introduces more
parameters that take more time to learn or tune and increases computational overhead.
Gholamalinezhad and Khosravi [2020] and Zafar et al. [2022] review several other
pooling techniques used in the literature.

3.5 Natural Language Processing

The challenge to solving Math Word Problems (MWPs) is found in the domain of Natu-
ral Language Processing (NLP). More specifically, MWPs are presented as textual data,
and then fed through an NLP model, which is expected to generate a mathematical
expression that solves the problem. NLP models process text in its token form, which is
produced through a process known as tokenization.

The objective to solve MWPs can be presented to the model as a sequence generation
task, where the model needs to generate the tokens corresponding to the correct solu-
tion, or as a ranking task, where the model generates scores for several hypotheses as
its choice to the final solution. In either case, the model first encodes each of the to-
kens into an embedding space and then uses these embeddings to generate the desired
outcome. The point-wise ranking task can be viewed as a metric learning task, with
the input MWP as an anchor, the model learns to score positive solutions higher than
negative solutions [McFee and Lanckriet 2010; Cakir et al. 2019].

The remainder of this section is organized as follows: In Section 3.5.1, we delve into
the tokenization process, encompassing word-level, character-level, subword-level, and
byte-level tokenizers. Subsequently, in Section 3.5.2, we explore two generative ob-
jective techniques, namely, greedy search and beam search. We then delve into the
utilization of label smoothing and teacher forcing, which can effectively improve the
generation objective during the training process.

29


3.5.1 Tokenization

Tokenization is an integral part of virtually every NLP task and is arguably the first step
in preprocessing that is linguistically motivated because it identifies the basic units on
which all other processing is based [Michelbacher 2013]. Tokenization breaks raw text
into discrete pieces representing either words, phrases, symbols, or other meaningful
elements called tokens [Gupta and Malhotra 2015; Verma et al. 2014]. The tokens can
be treated as atomic components of the text, and in turn, form the building blocks of
NLP [Lyon 2022]. Having tokenized our dataset, we can create our vocabulary, which
is a set of unique tokens used in the text corpus.

Text can be tokenized using a variety of methods, including word-level, character-level,
and subword tokenization, respectively. Inadequate tokenization of any portion of the
text can result in misconceptions later on in the natural language processing process.

Word-level Tokenization

A relatively simple and straightforward approach to split the text into words or tokens
is to use a delimiter, such as white spaces or punctuation. Advanced rules can be
incorporated to handle hyphens, suffixes, and prefixes [Honnibal and Montani 2017;
Speer 2019; Koehn et al. 2007]. Word tokenization collectively refers to approaches
that split text on a word level.

An issue with word tokens is that taking every unique word in a large text corpus can
lead to a very huge vocabulary sizes. For instance, Dai et al. [2019] use a word-level
tokenizer that generates a vocabulary that is four times larger than single language pre-
trained Transformers utilizing other granular tokenization techniques. A big vocabulary
size necessitates an enormous input and output embedding layer, which increases the
model’s memory and time complexity.

Dealing with words that are “out of vocabulary” (OOV) is another significant challenge
associated with word tokens. The majority of the time, this problem can be fixed by
mapping all unknown words to a single token that represents ‘unknown’ tokens. This
presents a number of difficulties, one of which is that the meaning of many of the
tokens that are OOV is lost.

Character-level Tokenization

Rather than splitting text into words, it could be split into characters, this is exactly
what character-level tokenization does. This resolves the above-mentioned issues with
the word-level tokenization technique. The vocabulary is relatively small, which can
have a size of 26 for a text-only case-insensitive English corpus, or 256 to incorporate
all the ASCII2 codes.

Character-level Tokenization can also handles OOV words coherently by preserving the

2https://www.asciitable.com/

30

https://www.asciitable.com/


information of the word. It breaks down the OOV word into characters and represents
the word in terms of these characters. This method is particularly useful for languages
where individual characters carry important information, such as Chinese or Japanese.
This strategy has been adopted by Xue et al. [2022] to design a token-free Transformer
that can be used in unilingual and multilingual settings.

Character-level tokenizer requires no training to learn a vocabulary. Several drawbacks
arise with this tokenization technique, one being the length of the sequences of tokens
it produces. Because each world is now comprised of all characters, the length of the
tokenized sequence has the potential to easily surpass that of the original text. In
addition, the model has to spend more capacity to reach a higher-level representation
compared to other tokenization methods since the individual tokens on their own do
not have any semantic meaning.

Subword Tokenization

The primary goal of subword-based tokenization techniques is to solve the large vocab-
ulary size and a significant number of OOV tokens in word-based tokenization, and the
challenges presented by character-based tokenization, which results in extremely long
sequences and individual tokens with little significance. To do this, the subword-based
tokenization algorithms do not split the frequently used words into smaller subwords.

Some of the popular subword tokenization algorithms are WordPiece [Devlin et al.
2018], Byte-Pair Encoding (BPE) [Sennrich et al. 2015], Unigram, and SentencePiece
[Kudo and Richardson 2018]. Our research adopts the BPE tokenizer, in particular, the
Byte-level BPE tokenizer used in our pre-trained model [Lewis et al. 2019; Radford et
al. 2019]. This is crucial because a pre-trained model can only work successfully if it
is fed input that has been tokenized using the same rules as were used to tokenize its
training data.

Byte-level BPE Tokenization

Byte-level BPE (BBPE) tokenizer a variant of the BPE tokenizer [Gage 1994; Sennrich
et al. 2015], which was initially developed as an algorithm to compress texts and has
been widely adopted for pre-trained language models [Liu et al. 2019c; Radford et al.
2018 2019; Lewis et al. 2019; Lample and Conneau 2019].

While standard BPE uses characters as the base tokens, BBPE utilizes bytes, such as
those found in Unicode encoding, as the fundamental units of tokenization. This allows
the tokenization to operate at a finer granularity, similar to character tokenization. The
tokenization process can either remain at the byte level or use Byte Pair Encoding to
merge bytes into larger units, depending on the specific requirements of the task.

In practice, the Byte-level BPE Tokenization process begins by creating a base vocab-
ulary using bytes. Using the byte-level base tokens, the BBPE algorithm identifies the
most frequent ”byte-pairs” and replaces them with a new, combined token that repre-
sents both of the original tokens. This process continues until a predetermined num-
ber of merge operations have been completed or the desired vocabulary size has been

31


achieved. The number of merge operations and the size of the target vocabulary are
configurable hyper-parameters [Wang et al. 2020; Radford et al. 2019].

The benefit of BPE is that it can capture both common and uncommon words in a
corpus. BPE ensures that the most common words are represented as a single token in
the vocabulary, while the rare and OOV words are broken down into smaller sub-word
units that are shared with other words, allowing them to be more easily represented in
the model. This makes BPE especially useful for languages with complex morphology,
where words can take on a variety of inflections and conjugations. Moreover, the length
of tokenized texts is shorter than character-level tokenization, and the vocabulary size
is smaller than word-level tokenization [Mielke et al. 2021; Webster and Kit 1992].

3.5.2 Sequential Generation Task

The sequential generation task in NLP involves generating text one word or token at a
time, in sequential order. The output generated at each time step is used as input for
generating the next output. The model learns to predict the probability distribution of
the next word given the previous words generated, and on some input context. Greedy
search and beam search are two commonly used algorithms for generating sequences
in NLP. Sequential generation is widely used in various NLP tasks such as language
translation, text summarization, and dialogue systems [Lin et al. 2022; Kalyan et al.
2021; Allahyari et al. 2017].

Greedy Search

In NLP, greedy search is a simple algorithm for generating text sequences. It is a type
of search algorithm that generates the next word in a sequence by selecting the word
with the highest probability at each time step. The initial seed for a greedy search can
be a single word, phrase, or sentence. The model then generates, at each time step, a
probability distribution over the vocabulary of possible subsequent words, based on the
context of the preceding words. The model then selects the most probable word from
this distribution and inserts it into the generated sequence. This process continues until
a stopping criterion is met, such as a maximum sequence length or an end-of-sequence
token. The resultant sequence is then output as the generated text’s final output.

One of the primary benefits of greedy search is its simplicity and efficiency. It is a
straightforward algorithm that requires minimal computational resources, making it
fast and easy to implement. There are, however, some disadvantages to using greedy
search. Due to the fact that it only considers the word with the highest probability at
each time step, it can lead to suboptimal results, especially when the model makes an
error early in the generation process, which can cause a cascade of errors [Meister et
al. 2020a; Freitag and Al-Onaizan 2017; Yoo et al. 2020].

Beam Search

Beam search is an alternative algorithm to greedy search that keeps track of the top
k most probable sequences at each time step, where k is a predefined beam size. The

32


algorithm is comparable to the greedy search algorithm, with the exception that it
chooses the top k words from the generated probability distribution based on their
probabilities. The algorithm then generates k new candidate sequences by appending
one of the k words to each of the k existing sequences.

The k new candidate sequences are then ranked according to their probabilities, and
the top k sequences are retained for the subsequent time step. This process continues
until a stopping criterion is met, such as a maximum sequence length or an end-of-
sequence token. The resulting sequence with the highest probability is then used to
generate the final text.

Beam search has the advantage of exploring multiple options at each time step, which
can result in higher output quality than greedy search. Beam search can handle situa-
tions where the most likely word at each time step might not lead to the best overall
sequence. It does this by keeping track of multiple candidate sequences, which gives it
a wider range of possible outcomes. Beam search is computationally more expensive
than greedy search due to the need to keep track of multiple candidate sequences at
each time step.

Beam search has the potential to result in repetitive or redundant output, as it prefers
taking up paths that are highly probable. Various techniques have been proposed to
modify the beam search algorithm to address this issue, such as length normalization,
which adjusts the probabilities based on the length of the sequence, and diverse beam
search, which encourages diversity in the output by penalizing redundant sequences
[Meister et al. 2020a; Freitag and Al-Onaizan 2017; Yoo et al. 2020; Meister et al.
2020b].

Label Smoothing

Label smoothing is a regularization technique that is used to prevent models from be-
coming overconfident and to improve their generalizability. Smoothing labels replaces
hard 0/1 targets in training data with soft targets that distribute probability mass to
labels with comparable attributes. A smoothed distribution that gives more weight to
similar labels is used in place of presenting a label as a one-hot vector with a single 1
and all other values as 0. This enables the model to identify representations that a