Reliable Answer Deduction (RAD)

Know When To And When Not To

CS7641 | Semester Project | Group 33

Introduction

There is a plethora of textual information out there and it has become increasingly difficult to draw insights from the data and find the relevant answers to our questions. Question-answering systems like Machine Reading Comprehension (MRC) systems are effective in retrieving useful information, wherein the model retrieves the answer from given comprehensions instead of the web. A lot of attention has been given to MRC tasks lately. Although existing models achieve reasonably good results, they may still produce unreliable answers when the questions are unanswerable and are computationally heavy. Thus, our aim here is to experiment and present a model which is more reliable.

Problem Definition

Our aim here is to leverage the power of Machine Learning and Natural Language Processing to create a model that deducts the answer when given a passage and also identifies if a question is unanswerable. We plan to develop an ensemble model which successfully accomplishes the task and gives reliable answers to the questions asked from the comprehension. Our proposed approach will be to innovate different modules in our architecture by taking inspiration from state of the art architectures.

Dataset

We are going to use Stanford Question Answering Dataset 2.0 (SQuAD2.0) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions. It was written adversarially by crowdworkers such that unanswerable questions look similar to answerable ones. To do well on SQuAD2.0, systems should answer questions when possible and also determine when no answer is supported by the paragraph and abstain from answering.

A sample of raw dataset has been shown below showcasing questions with their actual or plausible answers. The “is_impossible” flag is provided to distinguish between answerable and unanswerable questions and the features for the question vary accordingly.

Sample Dataset Q&A format

Sample Dataset

Sample Model Output (Source):

Sample Model Output

Algorithms/Methods:

We would be using a Deep Learning architecture for our Machine Reading Comprehension (MRC) task. It would involve the following sections/tasks - Embedding module, Feature extraction, Context question interaction, Verification module and answer prediction. We would use contextual embeddings from BERT and then experiment with the feature extraction techniques in combination with the attentive context question interaction methods. Span Extractor has been proven to work well as an answer predictor in MRC tasks in existing literature^[7] but we would be experimenting with other methods as well. We would also be exploring unsupervised models which learn via self-supervision.

Potential results and Discussion

We hope to achieve competent scores on the popularly used metrics for this task which are F1 score and EM score. These scores are already used in the SQuaD^[5] to compare various models on the dataset. Additionally, if we are able to develop a competent model, we would also like to focus on keeping the model light in terms of the model size, so that it could be deployed in places where computational resources are limited.

Gantt Chart

Midterm

Dataset Preprocessing & Exploration

When it comes to SQUAD 2.0 dataset, we found it to be mostly reliable and clean. However, we did do some basic cleaning exercise of removing the additional white spaces, conversion to lower case, stripping of unknown ASCII characters and tokenization as per the model requirement. Datapoints which had unreasonably small question length have been removed from both Training and Testing datasets. In order to convert the words in the passage to their root form to be in sync with the answers, we have used Lemmatization technique. Feature engineering was also done to find the end character of the answers given we have been provided with the start character.

The training dataset of SQUAD 2.0 is unbalanced with two thirds of questions being “Answerable” and the testing dataset is highly balanced with the “Answerable” questions comprising 49.9% of the data as can be seen below.

We also did an analysis to understand the distribution of questions per passage across the train and test datasets and concluded that test dataset has an average of 10 questions being asked per context passage and the distributions are shown below:

Next, we looked at the lengths of the context, questions and answers to understand the dataset better and identify any outliers

Lastly, we looked at the dominant words in the contexts and the questions by creating a wordcloud after removing the stopwords

Wordcloud for given Contexts	Wordcloud for given Questions

UnSupervised Learning

Generative Pre-trained Transformer 3

GPT-3 is an autoregressive language model which shows strong few shot learning capability on natural language tasks. It has a transformer architecture which is trained using the generative pre-training method. It has the capability to produce human-like answers. Since it can perform new tasks that it hasn’t been trained on, it can be used as an unsupervised method. Hence, we use GPT-3 with engine ‘text-davinci-002’ for our question answering task without providing the answers to it, more specifically, we ask GPT-3 to perform the following tasks to generate an answer. For the first task, we provide the model with only the question and task prompt Give an answer of length less than x words.. For the second task, we provide the model with the question, the context paragraph and the task prompt - Based on the context below give an answer to the question below. The answer should have less than x words. For both the tasks we provide GPT-3 with an answer length based on the dataset answer length to avoid undesirably long answers. On each answer we obtain the similarity by comparing the embeddings of GPT answer and the SQuAD dataset with the ‘text-similarity-davinci-001’ engine as shown in the image shown below. Finally, we compare the answers generated by the model with the answer provided in the dataset by computing L2 similarity score.

GPT-3 Output Analysis:

Under the unsupervised method using GPT-3 we analyzed the statistics on the answers. We compare GPT-3’s performance with the SQuAD dataset. We use a subset of 500 datapoints and use this to derive insights.

As discussed we assign GPT-3 the task of giving answers to our questions. We do it in branches. Firstly, we don’t give GPT-3 the context to our question. We allow it to use it’s vast knowledge gained by being trained on millions of articles to answer a question. Based on the word length of answers in SQuAD we try asking GPT-3 to limit it’s answer length to avoid unnecessary information leading to score diversion.

In the plot below, we see that our dataset gives much more concise answers compared to the GPT model. Also we see roughly the GPT answers are much more verbose. This comes from the fact that GPT has knowledge of things which are not necessarily in the context used in the SQuAD dataset. GPT roughly gives an answer of character length of 41 on average.

Now checking the similarity between answers from GPT-3 and SQuAD dataset we get a similarity plot as shown below.

We see that the graph is not skewed towards score of 1 meaning the fraction where GPT-3 results overlap with SQuAD considerably is less. This is probably due to the additional information given by GPT3. The average similarity score or accuracy is 81.9%.

Example (No Context): 
* Question: When did Beyonce start becoming popular?
* GPT-3 Answer: Beyonce started becoming popular in the early 2000s.
* SQuAD Answer: In the late 1990s.

Now when we give GPT the context along with the question we see interesting insights being derived. Firstly, we that the answer length now starts matching the answers from SQuAD dataset. And there is no divergence in the answer length between SQuAD and GPT-3. This is much more closer to SQuAD dataset. The average answer length is 28.0.

As expected we also see a huge improvement in the similarity scores between GPT-3 answers and SQuAD answers. We see that many questions are now answered with 100% accuracy. The average accuracy is 86.4%. Thus providing context makes answers both concise and precise to SQuAD.

Example (With Context): 
* Question: When did Beyonce start becoming popular?
* GPT-3 Answer: In late 1990s.
* SQuAD Answer: In the late 1990s.

We also verify the plausible answers to which SQuAD thinks it’s improbable to answer the question. From the graph below we see that when we give the context - the questions which are termed unanswerable and have a plausible answer are similar to GPT-3 predictions. On the other hand there is divergence when we don’t give the context. This is because without context GPT-3 takes in all the knowledge and answers all questions.

Scores for the current fine tuned model:

Measure	Context	Without Context
Similarity Avg Score (%)	86.4	81.9
Average Answer Length	28.0	41.0
Max Answer Length	251	306
Plausible Answer Similarity (%)	82.4	82.0

Now we analyze the answers generated by GPT across the various commonly asked questions like “When”, “What”, “How”, and “Why” against the actual answers provided by the SQuAD 2.0 dataset. Throughout majority of questions, we can see that answers given by GPT without any context are higher when compared to the other two categories. GPT with context is also highly in sync with true values for short answers except for the questions containing “Why”. These discrepancies can be attributed to low datapoints as only 10 questions were found to contain “Why” within our sample. Questions with “What” are the most reliable as they span almost 59% of the sample with 295 datapoints

Supervised Learning

Bi-Directional Attention Flow (BiDAF) model:

Introduction

Bi-Directional Attention Flow (BiDAF) is a multistage network which uses bi directional attention flow mechanism to model a query-aware context representation. We use this architecture for our question answering task due it’s effective attention computation at every time step which reduces information loss. Additionally, since the attention computed at every time step is a function of the context paragraph and the question at the current time step, it allows the model to have a memory-less attention mechanism which enables the model to learn the interaction between the given context and the question.

The BiDAF architecture details can be found here^[8]:

Feature Extraction

From the dataset, we consider all context and question pairs as different data samples. In order to extract the features for each sample we perform the following additional steps. First, we tokenize the context and retrieve span (start and end indices) of the word token in the context. Similarly, for all answers given we calculate the end character index and answer spans in the context tokens. For unanswerable questions, the start and end indices are set as -1. Finally, to feed into the model we have the following features: embedding indices of context tokens, embedding indices of all characters in the context tokens, embedding indices of question tokens, embedding indices of all characters in the question tokens, answers start spans’ and answer end spans’. We use pretrained GloVe embeddings to get the corresponding word vectors.

Results

Below are the results and training details for our preliminary fine tuning:

Train/NLL	Dev/NLL

Scores for the current fine tuned model:

F1 Score	54.91
EM Score	51.79
AvNA Score	62.04

Analysis of fine tuning:

The training is highly sensitive to the following parameters:

Batch size
Learning rate
Max answer length

The plots above are for batch size = 64, learning rate = 0.5 and max answer length = 15.

The calculations above are optimally computed using a few experiments until now but these are subject to change once we experiment with the parameters exhaustively. The model is still low relative to its potential on F1 and EM scores of BiDAF models tuned on this dataset. An interesting point to note is that the model has performed relatively better on AvNA metric which basically measures the classification accuracy of the model when only considering its answer (any span predicted) vs. no-answer predictions. This is due to the architecture of BiDAF which allows it to compare the predicted answer versus the no answer hypothesis effectively. One major challenge that we have faced is the limited availability of computing resources. Training process took a lot of time computationally and made effective testing with more combinations of hyperparameter tuning completely infeasible. We have tried reducing the training dataset points to deal with the issue, but it leads to higher loss on dev validation sets as well. We are yet to reach an estimate for the optimal point of this tradeoff.

Further Steps

We would further tune the BiDAF model experimenting with more combinations of hyperparameters. Going ahead we would be experimenting with relevant BERT based models for the QA task and performing comparative studies for the fine tuned models.

Progress Report after Midterm

Supervised Learning