Querying relational database systems in natural language using sequence to sequence learning with neural networks

Khalo, Nomonde

Querying relational database systems in natural language using sequence to sequence learning with neural networks

Files

KHALO Nomonde 317845 MSc Diss_Final_Msc.pdf (1.61 MB)

Date

2021

Authors

Khalo, Nomonde

Abstract

Retrieving information in the database requires a person to know a querying language, such as Structured Query Language (SQL). This becomes challenging for individuals with little or no understanding of database querying language. Querying databases using natural language can thus help individuals retrieve information without knowing SQL or the underlying domain of the databases. This research’s main objective is to comparatively investigate deep learning approaches, specifically sequence-to-sequence architectures for the task of translating natural language to SQL. This is achieved by answering three research questions; Which sequence-to-sequence architecture is more efficient in translating natural language to SQL?, What are the key factors of failures in the generation of semantically equivalent SQL queries?, How are failures mitigated in the generation semantically equivalent SQL queries?. The first research question is answered by comparatively investigating the sequence-to-sequence models; LSTM, BiLSTM, Encoder-Decoder, Column Attention, and Pointer Network on the WikiSQL dataset that consists of question-query pairs and their three SQL com-ponents, AGGREGATION, SELECT, and WHERE-Clause. The investigation showed that no one model is fit for an end-to-end solution for all the SQL components. The AGGREGATION showed to perform well on the the BiLSTM, the Column Attention showed to specifically handle the column names prediction for SELECT and WHERE-Clause. The Pointer Network is noted to predict more robust WHERE-Clause. The second research question is answered through the SQL output, it is evident from this, that due to lack of information on the question some models were unable to predict the correct information. Furthermore, the error analysis highlighted various types of errors generated, this includes, errors in the ground truth, multiple valid SQL queries, and unexplainable errors. The third question is answered by the error analysis, to mitigate error analysis, the ground truth needs to be standardised for each question that might have multiple SQL query, errors in ground truth needs to first be fixed so that the evaluation can be of quality, training the models according to their question capacity, where they are most successfully and able to predict is also a strategy that can mitigate errors. This work contributes to the body of work that investigate semantic parsing of natural language to SQL for querying databases, providing extensive study that shows that different SQL components achieves best results on different sequence-to-sequence models, comprehensive error analysis, and direction to mitigate the errors generated. Future work will focus on the generation of standardised ground truth for multiple valid SQL queries and correcting the ground truth

Description

A dissertation submitted to the Faculty of Science in fulfillment of the requirements for the degree Master of Science (Computer Science), 2021

URI

https://hdl.handle.net/10539/33099

Collections

ETD Collection

Full item page

Querying relational database systems in natural language using sequence to sequence learning with neural networks

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By