Jeeves 2.0

An NLP based question and answer system.

Project overview.

A play on the popular webpage ‘Ask Jeeves’, this project’s goal was to build an intelligent, NLP-based, question and answer system for a given corpus of factual based documents. 

 

Corpus collection and tagging.

On initialization, the program collects and parses the provided corpus of documents (the sample dataset included 10,000 documents of clean, factual noun phrases). It does this by first splitting and indexing the documents, then inputing the results into Stanford’s POS tagger (Part-Of-Speech Tagger) to tag each word. For those unfamiliar, the tagger iterates through the documents tagging each word with the most likely token from the Penn Treebank (Noun, Interjection, Adverb, etc.). Stanford’s tagger was chosen as it it fairly accurate, robust, and most importantly: free. At the conclusion of the tagging sequence, these documents are considered ‘tokenized’.

Now that the program has its processed data set, it’s time to accept and determine the meaning of the user's question.

 

Decoding the question.

To begin understanding the question, the program initializes the ‘Question’ object, the main source of indexed variables for later processing.

The first variable added to the Question object is the ‘Question Word’. The program initializes the Question word quite trivially, it’s ​simply​ ​the​ ​first​ ​word​ ​in​ ​the user input​ ​question string.​ ​Think about it, when you ask a question, you use a quite predictable set of interrogative words (​‘who’,​ ​‘when’,​ ​‘how’,​ ​etc.). This question word will later be helpful in inferencing what tag the answer will most likely have.

After​ ​obtaining​ ​the​ ​question​ ​word, the ​remaining​ ​​stop-words​ ​(e.g.​ ​‘the’,​ ​‘a’, ‘for’,​ ​‘am’,​ ​etc.) are removed from the query,​ ​as​ ​these​ ​words​ ​have​ ​little​​ ​linguistic and computational ​relevance​.​ ​Additionally,​ ​the​ ​deletion​ ​of​ ​these​ ​words​ ​aid​ ​in​ ​shortening​ ​the​ ​lookup​ ​time​ ​in​ ​the​ ​corpus.

To​ ​complete​ ​the​ ​​Question​​ ​object​ ​initialization,​ ​the​ words ​immediately preceding​ ​the​ ​noun​ ​phrase are removed and the remaining string is ​denoted​ ​as​ ​the​ ​​’focus’​ of the Question object.

 

TF-IDF scoring and document selection.

Now to get to the good stuff: how the program will pick the ​answer from the huge corpus of documents.​ ​It will first quantify​ ​the​ ​corpus​ ​with​ ​Term Frequency-Inverse Document Frequency (TF-IDF​) ​scoring.​ ​Here is a TF-IDF explanation and function description. The TF-IDF weighting greatly improves the answer selection by quantitatively exhibiting how important a word is in a document and throughout the entire corpus. 

Next,​ ​the program​ ​converts​ ​the​ ​context​ ​of​ ​the​ ​input​ ​question,​ determined previously in the Question object, ​to​ ​a​ ​vector​ ​using​ ​standard​ ​word-to-vector calculations. Similarly,​ ​all​​ ​sentences in each document get​ ​converted to a vector and weighted accordingly with their TF-IDF score. Finally, the question vector and each sentence vector are compared​ ​​using​ ​cosine​ ​similarity.​ ​If​ ​a​ ​document​ ​returns​ ​a​ ​similarity​ ​of​ ​0.3​ ​or greater,​ ​the vector calculations ​halt and ​that​ ​document is used for further answer selection.​ ​The cosine score threshold of ​0.3 was​ found to ​a​n acceptable​ ​heuristic​, as​ ​documents​ ​with​ ​0.3 had a high likelihood of containing the answer.

 

Finding the focus window.

As mentioned previously, the Question object has that handy ‘Question Word’ stored, giving the program a powerful inference of the answer. For example, if the query began with ‘Who’, the answer will most likely be tagged as a name. Similarly, if the question began with ‘When’ the answer will most likely have a date or time tag. Thus, it will parse and collect all sentences of the provided document containing an answer inference tag relating to the 'Question Word'.

Next, the system establishes a ‘focus window’, of size k, ​to​ ​remove​ ​extraneous​ ​candidate answers​ ​from​ ​the​ ​possible​ ​answer​ ​set.​ ​This​ ​improves answer selection​ ​because​ ​the​ ​desired​ ​tag​ ​type​ ​(and​ ​answer)​ ​appears ​close to​ ​the​ ​focus due​ ​to​ ​temporal​ ​similarity. It was found a focus window of size five was optimal. This means the system would search the current candidate pool to find the focus (from the Question object), then extract the preceding five words, and assign this string as the 'focus window'. If the focus did not appear in the candidate pool, then all candidates are passed to the next step.

 

Considering hyponymy.

The system now accounts for possible hyponym and hypernym answer possibilities. To do this, the​ ​system​ ​searches​ ​for​ ​the​ ​existence​ ​of​ ​a​ ​​hyponym ​and/or​ ​hypernym​ ​of​ ​the​ ​focus​ ​word(s)​ ​within​ ​the focus window​ ​using​ ​​the ​WordNet​ ​package of​ ​NLTK​.​ ​​Hyponymy​​ ​shows​ ​the​ ​relationship​ ​between​ ​a​ ​generic term​ ​(​hypernym​)​ ​and​ ​a​ ​specific​ ​instance​ ​of​ ​it​ ​(hyponym​). For example, if​ ​the​ ​question​ ​object​ ​had​ ​the​ ​object​ ​of​ ​‘animal’​ ​and​ ​the​ ​top-scored​ ​focus window​ ​had​ ​‘dog’​ ​in​ ​it,​ ​the system​ ​would​ ​select​ ​the​ ​noun phrase with​ ​‘dog’​ in it as an answer candidate ​because​ ​it​ ​is​ ​a​ ​hyponym​ ​of​ ​animal.​ ​

Evidently,​ ​this​ ​greatly​ ​increased​ ​the accuracy​ ​of​ ​the​ ​system, while making ​the ​system behave​ ​more​ ​intuitively,​ ​handling​ the ​ambiguity​ ​with​ ​linguistic similarity that humans do so well.

 

Viola, we have an answer!

After the linguistic filtering, the set​ ​of​ ​candidate​ ​answers are returned.​ ​More​ ​often​ ​than​ ​not,​ ​the​ ​candidate​ ​answer​ ​set​ ​consists​ ​of​ ​just one​ ​answer​ string ​due​ ​to​​ ​answer​ ​type​ ​inference​ ​and​ ​focus​ ​window selection. In such case, the noun contained in the returned noun phrase is provided as the answer. However, with a much larger and diverse corpus I predict this would not be the case. Sometimes,​ ​with​ ​hyponym​ ​and​ ​hypernym searching,​ ​there​ ​would be​ ​multiple​ ​candidate​ ​answers.​ ​For​ ​the​ ​sake​ ​of brevity,​ ​​a​ ​random​ ​word​ ​from​ ​the returned​ ​set​ ​is​ ​returned in this case.

 


Codebase can be found here.

Using Format