Querying relational database systems in natural language using sequence to sequence learning with neural networks
No Thumbnail Available
Date
2021
Authors
Khalo, Nomonde
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Retrieving information in the database requires a person to know a querying language, such as Structured Query Language (SQL). This becomes challenging for individuals with little or no understanding of database querying language. Querying databases using natural language can thus help individuals retrieve information without knowing SQL or the underlying domain of the databases. This research’s main objective is to comparatively investigate deep learning approaches, specifically sequence-to-sequence architectures for the task of translating natural language to SQL. This is achieved by answering three research questions; Which sequence-to-sequence architecture is more efficient in translating natural language to SQL?, What are the key factors of failures in the generation of semantically equivalent SQL queries?, How are failures mitigated in the generation semantically equivalent SQL queries?. The first research question is answered by comparatively investigating the sequence-to-sequence models; LSTM, BiLSTM, Encoder-Decoder, Column Attention, and Pointer Network on the WikiSQL dataset that consists of question-query pairs and their three SQL com-ponents, AGGREGATION, SELECT, and WHERE-Clause. The investigation showed that no one model is fit for an end-to-end solution for all the SQL components. The AGGREGATION showed to perform well on the the BiLSTM, the Column Attention showed to specifically handle the column names prediction for SELECT and WHERE-Clause. The Pointer Network is noted to predict more robust WHERE-Clause. The second research question is answered through the SQL output, it is evident from this, that due to lack of information on the question some models were unable to predict the correct information. Furthermore, the error analysis highlighted various types of errors generated, this includes, errors in the ground truth, multiple valid SQL queries, and unexplainable errors. The third question is answered by the error analysis, to mitigate error analysis, the ground truth needs to be standardised for each question that might have multiple SQL query, errors in ground truth needs to first be fixed so that the evaluation can be of quality, training the models according to their question capacity, where they are most successfully and able to predict is also a strategy that can mitigate errors. This work contributes to the body of work that investigate semantic parsing of natural language to SQL for querying databases, providing extensive study that shows that different SQL components achieves best results on different sequence-to-sequence models, comprehensive error analysis, and direction to mitigate the errors generated. Future work will focus on the generation of standardised ground truth for multiple valid SQL queries and correcting the ground truth
Description
A dissertation submitted to the Faculty of Science in fulfillment of the requirements for the degree Master of Science (Computer Science), 2021