A data driven approach to query expansion in question answering

A Data Driven Approach to
Query Expansion in
Question Answering
Leon Derczynski, Robert Gaizauskas,
Mark Greenwood and Jun Wang

Natural Language Processing Group
Department of Computer Science
University of Sheffield, UK

Summary
 Introduce a system for QA
 Find that its IR component limits system performance
 Explore alternative IR components
 Identify which questions cause IR to stumble
 Using answer lists, find extension words that make
these questions easier
 Show how knowledge of these words can make rapidly
accelerate the development of query expansion methods
 Show why one simple relevance feedback technique
cannot improve IR for QA

How we do QA
 Question answering system follows a linear
procedure to get from question to answers

 Pre-processing
 Text retrieval
 Answer Extraction

 Performance at each stage affects later results

Measuring QA Performance
 Overall metrics
 Coverage
 Redundancy
 TREC provides answers
 Regular expressions for matching text
 IDs of documents deemed helpful
 Ways of assessing correctness
 Lenient: the document text contains an answer
 Strict: further, the document ID is listed by TREC

Assessing IR Performance
 Low initial system performance

 Analysed each component in the system

 Question pre-processing correct

 Coverage and redundancy checked in IR part

IR component issues
 Only 65% of questions generate any text to
be prepared for answer extraction
 IR failings cap the entire system performance
 Need to balance the amount of information
retrieved for AE
 Retrieving more text boosts coverage, but
also introduces excess noise

Initial performance
 Lucene statistics
Question year Coverage Redundancy

2004 63.6% 1.62

2005 56.6% 1.15

2006 56.8% 1.18

Using strict matching, at paragraph level

Potential performance inhibitors
 IR Engine
 Is Lucene causing problems?
 Profile some alternative engines

 Difficult questions
 Identify which questions cause problems
 Examine these:
 Common factors
 How can they be made approachable?

Information Retrieval Engines
 AnswerFinder uses a modular framework, including
an IR plugin for Lucene
 Indri and Terrier are two public domain IR engines,
which have both been adapted to perform TREC
tasks
 Indri – based on the Lemur toolkit and INQUERY engine
 Terrier – developed in Glasgow for dealing with terabyte
corpora
 Plugins are created for Indri and Terrier, which are
then used as replacement IR components
 Automated testing of overall QA performance done
using multiple IR engines

IR Engine performance
Engine Coverage Redundancy

Indri 55.2% 1.15
Lucene 56.8% 1.18
Terrier 49.3% 1.00

With n=20; strict retrieval; TREC 2006 question set; paragraph-level texts.
• Performance between engines does not seem to vary
significantly
• Non-QA-specific IR Engine tweaking possibly not a great avenue
for performance increases

Identification of difficult
questions
 Coverage of 56.8% indicates that for over 40% of questions, no
documents are found.

 Some questions are difficult for all engines

 How to define a “difficult” question?

 Calculate average redundancy (over multiple engines) for each
question in a set

 Questions with average redundancy less than a certain threshold
are deemed difficult

 A threshold of zero is usually enough to find a sizeable dataset

Examining the answer data
 TREC answer data provides hints to what
documents an IR engine ideal for QA should
retrieve
 Helpful document lists
 Regular expressions of answers
 Some questions are marked by TREC as
having no answer; these are excluded from
the difficult question set

Making questions accessible
 Given the answer bearing documents and answer
text, it’s easy to extract words from answer-bearing
paragraphs
 For example, where the answer is “baby monitor”:
The inventor of the baby monitor found this device
almost accidentally
 These surrounding words may improve coverage
when used as query extensions
 How can we find out which extension words are
most helpful?

Rebuilding the question set
 Only use answerable difficult questions
 For each question:
 Add original question to the question set as a control
 Find target paragraphs in “correct” texts
 Build a list of all words in that paragraph, except: answers,
stop words, and question words
 For each word:
 Create a sub-question which consists of the original
question, extended by that word

Rebuilding the question set
Example:

 Single factoid question: Q + E
 How tall is the Eiffel tower? + height

 Question in a series: Q + T + E
 Where did he play in college? + Warren Moon +
NFL

Do data-driven extensions
help?
 Base performance is at or below the difficult
question threshold (typically zero)

 Any extension that brings performance above zero
is deemed a “helpful word”

 From the set of difficult questions, 75% were made
approachable by using a data-driven extension

 If we can add these terms accurately to questions,
the cap on answer extraction performance is raised

help?
 Question Where did he play in college?
 Target Warren Moon
 Base redundancy is zero

 Extensions
 Football Redundancy: 1
 NFL Redundancy: 2.5

 Adding some generic related words improves
performance

help?
 Question Who was the nominal leader after the
overthrow?
 Target Pakistani government overthrown in 1999

 Extensions
 Islamabad Redundancy: 2.5
 Pakistan Redundancy: 4
 Kashmir Redundancy: 4

 Location based words can raise redundancy

help?
 Question Who have commanded the division?
 Target 82nd Airborne Division
 Question expects a list of answers

 Extensions
 Col Redundancy: 2
 Gen Redundancy: 3
 officer Redundancy: 1
 decimated Redundancy: 1

 The proper names for ranks help; this can be hinted at by “Who”
 Events related to the target may suggest words
 Possibly not a victorious unit!

Observations on helpful words
 Inclusion of pertainyms has a positive effect
on performance, agreeing with more general
observations in Greenwood (2004)
 Army ranks stood out highly
 Use of an always-include list
 Some related words help, though there’s
often no deterministic relationship between
them and the questions

Measuring automated
expansion
 Known helpful words are also the target set of words
that any expansion method should aim for
 Once the target expansions are known, measuring
automated expansion becomes easier
 No need to perform IR for every candidate
expanded query (some runs over AQUAINT took up
to 14 hours on a 4-core 2.3GHz system)
 Rapid evaluation permits faster development of
expansion techniques

Relevance feedback in QA
 Simple RF works by using features of an initial
retrieval to alter a query
 We picked the highest frequency words in the
“initially retrieved texts”, and used them to
expand a query
 The size of the IRT set is denoted r
 Previous work (Monz 2003) looked at relevance
feedback using a small range of values for r
 Different sizes of initial retrievals are used, between
r=5 and r=50

Rapidly evaluating RF
 Three metrics show how a query expansion
technique performs:
 Percentage of all helpful words found in IRT
 This shows the intersection between words in initially
retrieved texts, and the helpful words.
 Percentage of texts containing helpful words
 If this is low, then the IR system does not retrieve many
documents containing helpful words, given the initial query
 Percentage of expansion terms that are helpful
 This is a key statistic; the higher this is, the better
performance is likely to be

Relevance feedback
predictions
RF selects some words to be added on to a query, based on an initial search.

2004 2005 2006

Helpful words found in IRT 4.2% 18.6% 8.9%
IRT containing helpful words 10.0% 33.3% 34.3%
RF words that are “helpful” 1.25% 1.67% 5.71%
 Less than 35% of the documents used in relevance feedback actually
contain helpful words
 Picking helpful words out from initial retrievals is not easy, when there’s
so much noise
 Due to the small probability of adding helpful words, relevance feedback
is likely not to make difficult questions accessible.
 Adding noise to the query will drown out otherwise helpful documents for
non-difficult questions

Relevance feedback results
Coverage at n docs r=5 r=50 Baseline
10 34.7% 28.4% 43.4%
20 44.4% 39.8% 55.3%
 Only 1.25% - 5.71% of the words that relevance
feedback chose were actually helpful; the rest only
add noise
 Performance using TF-based relevance feedback is
consistently lower than the baseline
 Hypothesis of poor performance is supported

Conclusions
 IR engine performance for QA does not vary
wildly
 Identifying helpful words provides a tool for
assessing query expansion methods
 TF-based relevance feedback cannot be
generally effective in IR for QA
 Linguistic relationships exist that can help in
query expansion

A data driven approach to query expansion in question answering

Recommended

Recommended

More Related Content

Similar to A data driven approach to query expansion in question answering

Similar to A data driven approach to query expansion in question answering (20)

More from Leon Derczynski

More from Leon Derczynski (20)

Recently uploaded

Recently uploaded (20)

A data driven approach to query expansion in question answering

Editor's Notes