An NLP based question and answer system.
A play on the popular webpage ‘Ask Jeeves’, this project’s goal was to build an intelligent, NLP-based, question and answer system for a given corpus of factual based documents.
Corpus collection and tagging.
On initialization, the program collects and parses the provided corpus of documents (the sample dataset included 10,000 documents of clean, factual noun phrases). It does this by first splitting and indexing the documents, then inputing the results into Stanford’s POS tagger (Part-Of-Speech Tagger) to tag each word. For those unfamiliar, the tagger iterates through the documents tagging each word with the most likely token from the Penn Treebank (Noun, Interjection, Adverb, etc.). Stanford’s tagger was chosen as it it fairly accurate, robust, and most importantly: free. At the conclusion of the tagging sequence, these documents are considered ‘tokenized’.
Now that the program has its processed data set, it’s time to accept and determine the meaning of the user's question.
Decoding the question.
To begin understanding the question, the program initializes the ‘Question’ object, the main source of indexed variables for later processing.
The first variable added to the Question object is the ‘Question Word’. The program initializes the Question word quite trivially, it’s simply the first word in the user input question string. Think about it, when you ask a question, you use a quite predictable set of interrogative words (‘who’, ‘when’, ‘how’, etc.). This question word will later be helpful in inferencing what tag the answer will most likely have.
After obtaining the question word, the remaining stop-words (e.g. ‘the’, ‘a’, ‘for’, ‘am’, etc.) are removed from the query, as these words have little linguistic and computational relevance. Additionally, the deletion of these words aid in shortening the lookup time in the corpus.
To complete the Question object initialization, the words immediately preceding the noun phrase are removed and the remaining string is denoted as the ’focus’ of the Question object.
TF-IDF scoring and document selection.
Now to get to the good stuff: how the program will pick the answer from the huge corpus of documents. It will first quantify the corpus with Term Frequency-Inverse Document Frequency (TF-IDF) scoring. Here is a TF-IDF explanation and function description. The TF-IDF weighting greatly improves the answer selection by quantitatively exhibiting how important a word is in a document and throughout the entire corpus.
Next, the program converts the context of the input question, determined previously in the Question object, to a vector using standard word-to-vector calculations. Similarly, all sentences in each document get converted to a vector and weighted accordingly with their TF-IDF score. Finally, the question vector and each sentence vector are compared using cosine similarity. If a document returns a similarity of 0.3 or greater, the vector calculations halt and that document is used for further answer selection. The cosine score threshold of 0.3 was found to an acceptable heuristic, as documents with 0.3 had a high likelihood of containing the answer.
Finding the focus window.
As mentioned previously, the Question object has that handy ‘Question Word’ stored, giving the program a powerful inference of the answer. For example, if the query began with ‘Who’, the answer will most likely be tagged as a name. Similarly, if the question began with ‘When’ the answer will most likely have a date or time tag. Thus, it will parse and collect all sentences of the provided document containing an answer inference tag relating to the 'Question Word'.
Next, the system establishes a ‘focus window’, of size k, to remove extraneous candidate answers from the possible answer set. This improves answer selection because the desired tag type (and answer) appears close to the focus due to temporal similarity. It was found a focus window of size five was optimal. This means the system would search the current candidate pool to find the focus (from the Question object), then extract the preceding five words, and assign this string as the 'focus window'. If the focus did not appear in the candidate pool, then all candidates are passed to the next step.
The system now accounts for possible hyponym and hypernym answer possibilities. To do this, the system searches for the existence of a hyponym and/or hypernym of the focus word(s) within the focus window using the WordNet package of NLTK. Hyponymy shows the relationship between a generic term (hypernym) and a specific instance of it (hyponym). For example, if the question object had the object of ‘animal’ and the top-scored focus window had ‘dog’ in it, the system would select the noun phrase with ‘dog’ in it as an answer candidate because it is a hyponym of animal.
Evidently, this greatly increased the accuracy of the system, while making the system behave more intuitively, handling the ambiguity with linguistic similarity that humans do so well.
Viola, we have an answer!
After the linguistic filtering, the set of candidate answers are returned. More often than not, the candidate answer set consists of just one answer string due to answer type inference and focus window selection. In such case, the noun contained in the returned noun phrase is provided as the answer. However, with a much larger and diverse corpus I predict this would not be the case. Sometimes, with hyponym and hypernym searching, there would be multiple candidate answers. For the sake of brevity, a random word from the returned set is returned in this case.