尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Module 1
Information Retrieval
Topics
 Introduction
 Information versus Data Retrieval
 IR: Past, present, and future
 Basic concepts: The retrieval process
 Logical view of documents
 Modeling: A Taxonomy of IR models
 Ad-hoc retrieval and filtering
 Classic IR models
 Set theoretic
 Algebraic
 Probabilistic IR models
 Models for browsing
Introduction
 Information retrieval (IR) deals with the
representation, storage, organization and access
to information items.
 The representation and organization of the
information items should provide the user with
easy access to the information in which he is
interested.
 Unfortunately, characterization of the user
information need is not a simple problem.
 Web search engines are the most visible IR
applications.
 The user must first translate the request
information into a query which can be processed
by the search engine.
 An information retrieval process begins when a
user enters a query into the system.
 Queries are formal statements of information
needs, for example search strings in web search
engines.
 Given the user query, the key goal of an IR
system is to retrieve information which might be
useful or relevant to the user.
 The emphasis is on the retrieval of information as
opposed to the retrieval of data.
 Data is a row fact and information is the
processed data.
Information versus Data Retrieval
Information Retrieval Data Retrieval
Retrieves information about a subject Determines the keywords in the user
query and retrieves data
Small errors are likely to go unnoticed. Single erroneous object means total
failure
Deals with natural language text Deals with relational database
Not always well structured and is
semantically ambiguous.
Has a well defined structure and
semantics
Does not provides solution to the user
of a database system
Provides solution to the user of a
database system
Extracts syntactic and semantic
information from the document text and
use this information to match the user
information need
Does not solve the problem of retrieving
information about a subject
The primary goal is to retrieve all the
documents which are relevant to a user
query while retrieving as few non-
The primary goal is to retrieve the data
according to the user query
IR: Past, present, and future
 A typical example of information retrieval is the table of
contents of a book
 Since the volume of information eventually grew
beyond a few books, it became necessary to build
specialized data structures to ensure faster access to
the stored information.
 An old and popular data structure for faster
information retrieval is a collection of selected words
or concepts with which are associated pointers to the
related information is the index.
 In one form or another, indexes are at the core of
every modern information retrieval system.
 They provide faster access to the data and allow the
query processing task to be speeded up.
 Two different views of the IR problem: a
computer-centered one and a human-centered
one.
 In the computer-centered view, the IR problem
consists mainly of building up efficient indexes,
processing user queries with high performance,
and developing ranking algorithms which improve
the `quality' of the answer set.
 In the human-centered view, the IR problem
consists mainly of studying the behavior of the
user, of understanding his main needs, and of
determining how such understanding affects the
organization and operation of the retrieval system
Information Retrieval in the Library
 Libraries were among the first institutions to adopt
IR systems for retrieving information.
 In the first generation, the systems allowed
searches based on author name and title.
 In the second generation, increased search
functionality was added which allowed searching
by subject headings, by keywords.
 In the third generation, which is currently being
deployed, the focus is on improved graphical
interfaces, electronic forms, hypertext features,
and open system architectures.
The Web and Digital Libraries
 Three dramatic and fundamental changes have
occurred due to the advances in modern computer
technology and the boom of the Web.
 First, it became a lot cheaper to have access to various
sources of information.
 Second, the advances in all kinds of digital
communication provided greater access to networks.
 Third, the freedom to post whatever information
someone judges useful has greatly contributed to the
popularity of the Web.
 Fundamentally, low cost, greater access, and
publishing freedom have allowed people to use the
Web as a highly interactive medium.
 Such interactivity allows people to exchange
messages, photos, documents, software, videos, and
Practical Issues
 Security
 Privacy
 Copyright
Basic concepts: The retrieval process
 To describe the retrieval process, we use a simple
and generic software architecture .
 First of all, before the retrieval process can even be
initiated, it is necessary to define the text database.
 This is usually done by the manager of the
database, which specifies the following:
(a) The documents to be used
(b) The operations to be performed on the text
(c) The text model (i.e., the text structure and what
elements can be retrieved). The text operations
transform the original documents and generate a
logical view of them.
 Once the logical view of the documents is defined,
the database manager (using the DB Manager
Module) builds an index of the text.
 An index is a critical data structure because it allows
fast searching over large volumes of data.
 Given that the document database is indexed, the
retrieval process can be initiated.
 The user first specifies a user need which is then
parsed and transformed by the same text operations
applied to the text.
 Then, query operations might be applied before the
actual query, which provides a system representation
for the user need, is generated.
 The query is then processed to obtain the retrieved
documents.
 Fast query processing is made possible by the index
structure previously built.
 Before been sent to the user, the retrieved
documents are ranked according to a likelihood of
relevance.
 At this point, he might pinpoint a subset of the
documents seen as definitely of interest and initiate
a user feedback cycle.
 In such a cycle, the system uses the documents
selected by the user to change the query
formulation.
Logical view of documents
 Due to historical reasons, documents in a
collection are frequently represented through a
set of index terms or keywords.
 Such keywords might be extracted directly from
the text of the document or might be specified by
a human subject.
 No matter whether these representative keywords
are derived automatically or generated by a
specialist, they provide a logical view of the
document.
 Modern computers are making it possible to
represent a document by its full set of words. In
this case, we say that the retrieval system adopts
a full text logical view of the documents.
 With very large collections, however, even modern
computers might have to reduce the set of
representative keywords.
 This can be accomplished through the elimination
of stopwords (such as articles and connectives), the
use of stemming (which reduces distinct words to
their common grammatical root), and the
identification of noun groups (which eliminates
adjectives, adverbs, and verbs).
 These operations are called text operations (or
transformations).
 Text operations reduce the complexity of the
document representation and allow moving the
logical view from that of a full text to that of a set
of index terms.
Modeling: A Taxonomy of IR models
 Three models:
 Classic models
 Structured models
 Browsing models
 Three classic models in IR are :
 Boolean: documents and queries are represented as
sets of index terms. Also known as set theoretic.
 Vector: documents and queries are represented as
vectors in a t-dimensional space. Also known as
algebraic.
 Probabilistic: the framework for modeling document and
query representations is based on probability theory.
 Set theoretic
 Fuzzy
 Extended boolean
 Algebraic
 Generalized vector
 Latent semantic indexing
 Neural networks
 Probabilistic
 Inference network
 Belief network
 Two structured models in IR are
 Non-overlapping lists model
 Proximal nodes
 Three models for browsing are:
 Flat
 Structure guided
 Hypertext
Ad-hoc retrieval and filtering
 Ad hoc retrieval
 Standard retrieval task in which the user specifies his
information need through a query which initiates a
search (executed by the information system) for
documents which are likely to be relevant to the user.
 The documents in the collection remain relatively static
while new queries are submitted to the system
 The most common form of user task
 Filtering
 The queries remain relatively static while new
documents come into the system (and leave)
 User profile
 Describing the user’s preferences
 Routing (variation of filtering, rank the filtered document)
 User profile is compared to the incoming documents
to determine the user’s interest.
 Eg: selecting a news article among thousands of
articles which are broadcast each day
 The task of determining which ones are really relevant
is fully reserved to the user.
 This is accomplished by ranking the filtered
documents and thereby user should examine smaller
number of documents. This variation of filtering is
called routing.
 To rank the documents vector model is preferred.
 In filtering, crucial task is not ranking, but the
construction of the user profile.
 An approach for constructing a user profile is to
describe the profile through a set of keywords and to
require the user to provide the necessary keywords.
 The process is to collect information from the user
about his preferences and use this information to
build the user profile.
 In the beginning user provides a set of keywords
which describes an initial profile of his preferences.
 As new documents arrive the system uses this
profile to select documents that are relevant and not
relevant.
 The system uses this information to adjust the user
profile description such that it reflects new
preferences.
Formal characterization of IR models
 An IR model is a quadruple [D, Q, F, R(qi, dj)] where
1. D is a set of logical views for the documents in the
collection
2. Q is a set of logical views for the user queries
3. F is a framework for modeling documents and
queries
4. R(qi, dj) is a ranking function
Classic IR models
 Basic concepts: Each document is described by a set of
representative keywords called index terms
 An index term is a word that represents a documents
main theme.
 It is used to index and summarize the document contents.
 Assign a numerical weights to distinct relevance between
index terms
 Three classic models: Boolean, vector, probabilistic
 ki : A generic index term
 K : The set of all index terms {k1,…,kt}
 wi,j : A weight associated with index term
ki of a document dj
For an index term does not appear in the document , wi,j=0
 gi: A function returns the weight associated
Boolean model
 Simple retrieval model based on set theory and
Boolean algebra
 Binary decision criterion
 Either relevant or not relevant (no partial match)
 Data retrieval model
 Advantage
 Simplicity
 Disadvantage
 It is not simple to translate an information need into a
Boolean expression
 Exact matching may lead to retrieval of too few or too
many documents
 Formal definition
 For the Boolean model, the index term weight are
all binary, i.e. wij {0,1}
 A query q is composed of index terms linked by
three connectives: not, and , or.
 A query is a conventional Boolean expression,
which can be represented as a disjunction of
conjunctive vectors (in DNF)
 q= ka(kbkc)= (1,1,1)  (1,1,0)  (1,0,0)
where each of the component is a binary weighted
vector associated with the tuple (ka,kb,kc)
 The similarity of the document dj to the query q is
defined as
if (qcc )(ki, wi,j=gi(qcc))




0
1
),( qdsim j
(1,1,1)
(1,0,0) (1,1,0)Ka
Kb
Kc
dnfq

dnfq

 If 1 then the boolean model predicts
that the document dj is relevant to the query q
 Otherwise the prediction is that the document is not
relevant.
),( qdsim j
Vector model
 Assign non-binary weights to index terms in queries and in
documents
 Compute the similarity between documents and query =
Sim(dj,q)
 More precise than Boolean model
 The weight associated with the pair
(ki,dj) positive and non-binary.
 The index terms in the query are also weighted
 The vector for a document dj is represented by
 is the query vector where wi,q be the weight
associated with the pair (ki,q) where wi,q >=0
 Term weights are used to compute the degree of
similarity between documents and the user query. It
varies from 0 to +1.
),,,(
),,,(
,,2,1
,,2,1
qtqq
jtjjj
wwwq
wwwd






q

 A document might be retrieved even if it matches the
query only partially.
 Degree of similarity

dj
q
Figure 2.4 The cosine of  is adopted
as sim(dj,q)









t
i qi
t
i ji
t
i qiji
j
j
j
ww
ww
qd
qd
qdsim
1
2
,1
2
,
1 ,,
||||
),( 

 Advantages
 Its term-weighting scheme improves retrieval
performance
 Its partial matching strategy allows retrieval of
documents that approximate the query conditions
 Its cosine ranking formula sorts the documents
according to their degree of similarity to the query
 Disadvantage
 The assumption of mutual independence between index
terms
Probabilistic model
 Introduced by Roberston and Sparck Jones, 1976
 Binary independence retrieval (BIR) model
 Idea: Given a user query q, and the ideal answer set R of
the relevant documents, the problem is to specify the
properties for this set
 Assumption (probabilistic principle): the probability of
relevance depends on the query and document
representations only; ideal answer set R should
maximize the overall probability of relevance
 The probabilistic model tries to estimate the probability
that the user will find the document dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)
 Definition
 All index term weights are all binary i.e., wi,j  {0,1}
 Let R be the set of documents known to be relevant to
query q
 Let be the complement of R
 Let be the probability that the document dj is
relevant to the query q
 Let be the probability that the document dj is
nonelevant to query q
)|( jdRP
)|( jdRP
R
 Pr(ki |R) stands for the probability that the index
term ki is present in a document randomly selected
from the set R
 stands for the probability that the index
term ki is not present in a document randomly
selected from the set R
)|Pr( Rki
Models for browsing
Flat Browsing
Structure Guided Browsing
Hypertext Model
Information retrieval introduction
Information retrieval introduction

More Related Content

What's hot

Signature files
Signature filesSignature files
Signature files
Deepali Raikar
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
Primya Tamil
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
Vaibhav Khanna
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
baradhimarch81
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
silambu111
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
vimalsura
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
dhatchayaninandu
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval system
Leslie Vargas
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
Primya Tamil
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
Primya Tamil
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
Primya Tamil
 
Information retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic modelsInformation retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic models
Vaibhav Khanna
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Term weighting
Term weightingTerm weighting
Term weighting
Primya Tamil
 
Information retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of irInformation retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of ir
Vaibhav Khanna
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
hktripathy
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
Kamal Acharya
 

What's hot (20)

Signature files
Signature filesSignature files
Signature files
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Information retrieval system
Information retrieval systemInformation retrieval system
Information retrieval system
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Information retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic modelsInformation retrieval 13 alternative set theoretic models
Information retrieval 13 alternative set theoretic models
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Information retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of irInformation retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of ir
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithm
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 

Similar to Information retrieval introduction

Mam assign
Mam assignMam assign
Mam assign
silambu111
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
Ir 01
Ir   01Ir   01
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.L
anujessy
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
Melinda Watson
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
alaa223
 
Lec1
Lec1Lec1
Lec1
alaa223
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYINTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
cscpconf
 
Introduction abstract
Introduction abstractIntroduction abstract
Introduction abstract
Sanghvi Innovative Academy
 
Performance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information RetrievalPerformance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information Retrieval
idescitation
 
Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
unyil96
 
Chapter 1 Introduction to Information Storage and Retrieval.pdf
Chapter 1 Introduction to Information Storage and Retrieval.pdfChapter 1 Introduction to Information Storage and Retrieval.pdf
Chapter 1 Introduction to Information Storage and Retrieval.pdf
Habtamu100
 
Design and Implementation of Meetings Document Management and Retrieval System
Design and Implementation of Meetings Document Management and Retrieval SystemDesign and Implementation of Meetings Document Management and Retrieval System
Design and Implementation of Meetings Document Management and Retrieval System
CSCJournals
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
Information Filtration
Information FiltrationInformation Filtration
Information Filtration
Ali Jafar
 
intro.ppt
intro.pptintro.ppt
intro.ppt
UbaidURRahman78
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 

Similar to Information retrieval introduction (20)

Mam assign
Mam assignMam assign
Mam assign
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
Ir 01
Ir   01Ir   01
Ir 01
 
INFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.LINFORMATION RETRIEVAL Anandraj.L
INFORMATION RETRIEVAL Anandraj.L
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Lec1
Lec1Lec1
Lec1
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYINTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
 
Introduction abstract
Introduction abstractIntroduction abstract
Introduction abstract
 
Performance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information RetrievalPerformance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information Retrieval
 
Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
Chapter 1 Introduction to Information Storage and Retrieval.pdf
Chapter 1 Introduction to Information Storage and Retrieval.pdfChapter 1 Introduction to Information Storage and Retrieval.pdf
Chapter 1 Introduction to Information Storage and Retrieval.pdf
 
Design and Implementation of Meetings Document Management and Retrieval System
Design and Implementation of Meetings Document Management and Retrieval SystemDesign and Implementation of Meetings Document Management and Retrieval System
Design and Implementation of Meetings Document Management and Retrieval System
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
Information Filtration
Information FiltrationInformation Filtration
Information Filtration
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 

Recently uploaded

Basic principle and types Static Relays ppt
Basic principle and  types  Static Relays pptBasic principle and  types  Static Relays ppt
Basic principle and types Static Relays ppt
Sri Ramakrishna Institute of Technology
 
My Airframe Metallic Design Capability Studies..pdf
My Airframe Metallic Design Capability Studies..pdfMy Airframe Metallic Design Capability Studies..pdf
My Airframe Metallic Design Capability Studies..pdf
Geoffrey Wardle. MSc. MSc. Snr.MAIAA
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
sapna sharmap11
 
Call Girls Nagpur 8824825030 Escort In Nagpur service 24X7
Call Girls Nagpur 8824825030 Escort In Nagpur service 24X7Call Girls Nagpur 8824825030 Escort In Nagpur service 24X7
Call Girls Nagpur 8824825030 Escort In Nagpur service 24X7
sexytaniya455
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
gapboxn
 
Lateral load-resisting systems in buildings.pptx
Lateral load-resisting systems in buildings.pptxLateral load-resisting systems in buildings.pptx
Lateral load-resisting systems in buildings.pptx
DebendraDevKhanal1
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
Online train ticket booking system project.pdf
Online train ticket booking system project.pdfOnline train ticket booking system project.pdf
Online train ticket booking system project.pdf
Kamal Acharya
 
Call Girls Chandigarh 🔥 7014168258 🔥 Real Fun With Sexual Girl Available 24/7...
Call Girls Chandigarh 🔥 7014168258 🔥 Real Fun With Sexual Girl Available 24/7...Call Girls Chandigarh 🔥 7014168258 🔥 Real Fun With Sexual Girl Available 24/7...
Call Girls Chandigarh 🔥 7014168258 🔥 Real Fun With Sexual Girl Available 24/7...
shourabjaat424
 
College Call Girls Kolkata 🔥 7014168258 🔥 Real Fun With Sexual Girl Available...
College Call Girls Kolkata 🔥 7014168258 🔥 Real Fun With Sexual Girl Available...College Call Girls Kolkata 🔥 7014168258 🔥 Real Fun With Sexual Girl Available...
College Call Girls Kolkata 🔥 7014168258 🔥 Real Fun With Sexual Girl Available...
Ak47
 
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Banerescorts
 
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptxMODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
NaveenNaveen726446
 
Intuit CRAFT demonstration presentation for sde
Intuit CRAFT demonstration presentation for sdeIntuit CRAFT demonstration presentation for sde
Intuit CRAFT demonstration presentation for sde
ShivangMishra54
 
CSP_Study - Notes (Paul McNeill) 2017.pdf
CSP_Study - Notes (Paul McNeill) 2017.pdfCSP_Study - Notes (Paul McNeill) 2017.pdf
CSP_Study - Notes (Paul McNeill) 2017.pdf
Ismail Sultan
 
Technological Innovation Management And Entrepreneurship-1.pdf
Technological Innovation Management And Entrepreneurship-1.pdfTechnological Innovation Management And Entrepreneurship-1.pdf
Technological Innovation Management And Entrepreneurship-1.pdf
tanujaharish2
 
High Profile Call Girls Ahmedabad 🔥 7737669865 🔥 Real Fun With Sexual Girl Av...
High Profile Call Girls Ahmedabad 🔥 7737669865 🔥 Real Fun With Sexual Girl Av...High Profile Call Girls Ahmedabad 🔥 7737669865 🔥 Real Fun With Sexual Girl Av...
High Profile Call Girls Ahmedabad 🔥 7737669865 🔥 Real Fun With Sexual Girl Av...
dABGO KI CITy kUSHINAGAR Ak47
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
Lubi Valves
 
SPICE PARK JUL2024 ( 6,866 SPICE Models )
SPICE PARK JUL2024 ( 6,866 SPICE Models )SPICE PARK JUL2024 ( 6,866 SPICE Models )
SPICE PARK JUL2024 ( 6,866 SPICE Models )
Tsuyoshi Horigome
 
Update 40 models( Solar Cell ) in SPICE PARK(JUL2024)
Update 40 models( Solar Cell ) in SPICE PARK(JUL2024)Update 40 models( Solar Cell ) in SPICE PARK(JUL2024)
Update 40 models( Solar Cell ) in SPICE PARK(JUL2024)
Tsuyoshi Horigome
 
Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Call Girls Madurai 8824825030 Escort In Madurai service 24X7Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Poonam Singh
 

Recently uploaded (20)

Basic principle and types Static Relays ppt
Basic principle and  types  Static Relays pptBasic principle and  types  Static Relays ppt
Basic principle and types Static Relays ppt
 
My Airframe Metallic Design Capability Studies..pdf
My Airframe Metallic Design Capability Studies..pdfMy Airframe Metallic Design Capability Studies..pdf
My Airframe Metallic Design Capability Studies..pdf
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
 
Call Girls Nagpur 8824825030 Escort In Nagpur service 24X7
Call Girls Nagpur 8824825030 Escort In Nagpur service 24X7Call Girls Nagpur 8824825030 Escort In Nagpur service 24X7
Call Girls Nagpur 8824825030 Escort In Nagpur service 24X7
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Lateral load-resisting systems in buildings.pptx
Lateral load-resisting systems in buildings.pptxLateral load-resisting systems in buildings.pptx
Lateral load-resisting systems in buildings.pptx
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
Online train ticket booking system project.pdf
Online train ticket booking system project.pdfOnline train ticket booking system project.pdf
Online train ticket booking system project.pdf
 
Call Girls Chandigarh 🔥 7014168258 🔥 Real Fun With Sexual Girl Available 24/7...
Call Girls Chandigarh 🔥 7014168258 🔥 Real Fun With Sexual Girl Available 24/7...Call Girls Chandigarh 🔥 7014168258 🔥 Real Fun With Sexual Girl Available 24/7...
Call Girls Chandigarh 🔥 7014168258 🔥 Real Fun With Sexual Girl Available 24/7...
 
College Call Girls Kolkata 🔥 7014168258 🔥 Real Fun With Sexual Girl Available...
College Call Girls Kolkata 🔥 7014168258 🔥 Real Fun With Sexual Girl Available...College Call Girls Kolkata 🔥 7014168258 🔥 Real Fun With Sexual Girl Available...
College Call Girls Kolkata 🔥 7014168258 🔥 Real Fun With Sexual Girl Available...
 
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
 
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptxMODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
 
Intuit CRAFT demonstration presentation for sde
Intuit CRAFT demonstration presentation for sdeIntuit CRAFT demonstration presentation for sde
Intuit CRAFT demonstration presentation for sde
 
CSP_Study - Notes (Paul McNeill) 2017.pdf
CSP_Study - Notes (Paul McNeill) 2017.pdfCSP_Study - Notes (Paul McNeill) 2017.pdf
CSP_Study - Notes (Paul McNeill) 2017.pdf
 
Technological Innovation Management And Entrepreneurship-1.pdf
Technological Innovation Management And Entrepreneurship-1.pdfTechnological Innovation Management And Entrepreneurship-1.pdf
Technological Innovation Management And Entrepreneurship-1.pdf
 
High Profile Call Girls Ahmedabad 🔥 7737669865 🔥 Real Fun With Sexual Girl Av...
High Profile Call Girls Ahmedabad 🔥 7737669865 🔥 Real Fun With Sexual Girl Av...High Profile Call Girls Ahmedabad 🔥 7737669865 🔥 Real Fun With Sexual Girl Av...
High Profile Call Girls Ahmedabad 🔥 7737669865 🔥 Real Fun With Sexual Girl Av...
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
 
SPICE PARK JUL2024 ( 6,866 SPICE Models )
SPICE PARK JUL2024 ( 6,866 SPICE Models )SPICE PARK JUL2024 ( 6,866 SPICE Models )
SPICE PARK JUL2024 ( 6,866 SPICE Models )
 
Update 40 models( Solar Cell ) in SPICE PARK(JUL2024)
Update 40 models( Solar Cell ) in SPICE PARK(JUL2024)Update 40 models( Solar Cell ) in SPICE PARK(JUL2024)
Update 40 models( Solar Cell ) in SPICE PARK(JUL2024)
 
Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Call Girls Madurai 8824825030 Escort In Madurai service 24X7Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Call Girls Madurai 8824825030 Escort In Madurai service 24X7
 

Information retrieval introduction

  • 2. Topics  Introduction  Information versus Data Retrieval  IR: Past, present, and future  Basic concepts: The retrieval process  Logical view of documents  Modeling: A Taxonomy of IR models  Ad-hoc retrieval and filtering  Classic IR models  Set theoretic  Algebraic  Probabilistic IR models  Models for browsing
  • 3. Introduction  Information retrieval (IR) deals with the representation, storage, organization and access to information items.  The representation and organization of the information items should provide the user with easy access to the information in which he is interested.  Unfortunately, characterization of the user information need is not a simple problem.  Web search engines are the most visible IR applications.  The user must first translate the request information into a query which can be processed by the search engine.
  • 4.  An information retrieval process begins when a user enters a query into the system.  Queries are formal statements of information needs, for example search strings in web search engines.  Given the user query, the key goal of an IR system is to retrieve information which might be useful or relevant to the user.  The emphasis is on the retrieval of information as opposed to the retrieval of data.  Data is a row fact and information is the processed data.
  • 5. Information versus Data Retrieval Information Retrieval Data Retrieval Retrieves information about a subject Determines the keywords in the user query and retrieves data Small errors are likely to go unnoticed. Single erroneous object means total failure Deals with natural language text Deals with relational database Not always well structured and is semantically ambiguous. Has a well defined structure and semantics Does not provides solution to the user of a database system Provides solution to the user of a database system Extracts syntactic and semantic information from the document text and use this information to match the user information need Does not solve the problem of retrieving information about a subject The primary goal is to retrieve all the documents which are relevant to a user query while retrieving as few non- The primary goal is to retrieve the data according to the user query
  • 6. IR: Past, present, and future  A typical example of information retrieval is the table of contents of a book  Since the volume of information eventually grew beyond a few books, it became necessary to build specialized data structures to ensure faster access to the stored information.  An old and popular data structure for faster information retrieval is a collection of selected words or concepts with which are associated pointers to the related information is the index.  In one form or another, indexes are at the core of every modern information retrieval system.  They provide faster access to the data and allow the query processing task to be speeded up.
  • 7.  Two different views of the IR problem: a computer-centered one and a human-centered one.  In the computer-centered view, the IR problem consists mainly of building up efficient indexes, processing user queries with high performance, and developing ranking algorithms which improve the `quality' of the answer set.  In the human-centered view, the IR problem consists mainly of studying the behavior of the user, of understanding his main needs, and of determining how such understanding affects the organization and operation of the retrieval system
  • 8. Information Retrieval in the Library  Libraries were among the first institutions to adopt IR systems for retrieving information.  In the first generation, the systems allowed searches based on author name and title.  In the second generation, increased search functionality was added which allowed searching by subject headings, by keywords.  In the third generation, which is currently being deployed, the focus is on improved graphical interfaces, electronic forms, hypertext features, and open system architectures.
  • 9. The Web and Digital Libraries  Three dramatic and fundamental changes have occurred due to the advances in modern computer technology and the boom of the Web.  First, it became a lot cheaper to have access to various sources of information.  Second, the advances in all kinds of digital communication provided greater access to networks.  Third, the freedom to post whatever information someone judges useful has greatly contributed to the popularity of the Web.  Fundamentally, low cost, greater access, and publishing freedom have allowed people to use the Web as a highly interactive medium.  Such interactivity allows people to exchange messages, photos, documents, software, videos, and
  • 10. Practical Issues  Security  Privacy  Copyright
  • 11. Basic concepts: The retrieval process  To describe the retrieval process, we use a simple and generic software architecture .  First of all, before the retrieval process can even be initiated, it is necessary to define the text database.  This is usually done by the manager of the database, which specifies the following: (a) The documents to be used (b) The operations to be performed on the text (c) The text model (i.e., the text structure and what elements can be retrieved). The text operations transform the original documents and generate a logical view of them.  Once the logical view of the documents is defined, the database manager (using the DB Manager Module) builds an index of the text.
  • 12.  An index is a critical data structure because it allows fast searching over large volumes of data.  Given that the document database is indexed, the retrieval process can be initiated.  The user first specifies a user need which is then parsed and transformed by the same text operations applied to the text.  Then, query operations might be applied before the actual query, which provides a system representation for the user need, is generated.  The query is then processed to obtain the retrieved documents.  Fast query processing is made possible by the index structure previously built.  Before been sent to the user, the retrieved documents are ranked according to a likelihood of relevance.
  • 13.  At this point, he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle.  In such a cycle, the system uses the documents selected by the user to change the query formulation.
  • 14. Logical view of documents  Due to historical reasons, documents in a collection are frequently represented through a set of index terms or keywords.  Such keywords might be extracted directly from the text of the document or might be specified by a human subject.  No matter whether these representative keywords are derived automatically or generated by a specialist, they provide a logical view of the document.  Modern computers are making it possible to represent a document by its full set of words. In this case, we say that the retrieval system adopts a full text logical view of the documents.
  • 15.  With very large collections, however, even modern computers might have to reduce the set of representative keywords.  This can be accomplished through the elimination of stopwords (such as articles and connectives), the use of stemming (which reduces distinct words to their common grammatical root), and the identification of noun groups (which eliminates adjectives, adverbs, and verbs).  These operations are called text operations (or transformations).  Text operations reduce the complexity of the document representation and allow moving the logical view from that of a full text to that of a set of index terms.
  • 16.
  • 17. Modeling: A Taxonomy of IR models  Three models:  Classic models  Structured models  Browsing models  Three classic models in IR are :  Boolean: documents and queries are represented as sets of index terms. Also known as set theoretic.  Vector: documents and queries are represented as vectors in a t-dimensional space. Also known as algebraic.  Probabilistic: the framework for modeling document and query representations is based on probability theory.  Set theoretic  Fuzzy  Extended boolean
  • 18.  Algebraic  Generalized vector  Latent semantic indexing  Neural networks  Probabilistic  Inference network  Belief network  Two structured models in IR are  Non-overlapping lists model  Proximal nodes  Three models for browsing are:  Flat  Structure guided  Hypertext
  • 19.
  • 20. Ad-hoc retrieval and filtering  Ad hoc retrieval  Standard retrieval task in which the user specifies his information need through a query which initiates a search (executed by the information system) for documents which are likely to be relevant to the user.  The documents in the collection remain relatively static while new queries are submitted to the system  The most common form of user task  Filtering  The queries remain relatively static while new documents come into the system (and leave)  User profile  Describing the user’s preferences  Routing (variation of filtering, rank the filtered document)
  • 21.
  • 22.
  • 23.  User profile is compared to the incoming documents to determine the user’s interest.  Eg: selecting a news article among thousands of articles which are broadcast each day  The task of determining which ones are really relevant is fully reserved to the user.  This is accomplished by ranking the filtered documents and thereby user should examine smaller number of documents. This variation of filtering is called routing.  To rank the documents vector model is preferred.  In filtering, crucial task is not ranking, but the construction of the user profile.  An approach for constructing a user profile is to describe the profile through a set of keywords and to require the user to provide the necessary keywords.
  • 24.  The process is to collect information from the user about his preferences and use this information to build the user profile.  In the beginning user provides a set of keywords which describes an initial profile of his preferences.  As new documents arrive the system uses this profile to select documents that are relevant and not relevant.  The system uses this information to adjust the user profile description such that it reflects new preferences.
  • 25. Formal characterization of IR models  An IR model is a quadruple [D, Q, F, R(qi, dj)] where 1. D is a set of logical views for the documents in the collection 2. Q is a set of logical views for the user queries 3. F is a framework for modeling documents and queries 4. R(qi, dj) is a ranking function
  • 26. Classic IR models  Basic concepts: Each document is described by a set of representative keywords called index terms  An index term is a word that represents a documents main theme.  It is used to index and summarize the document contents.  Assign a numerical weights to distinct relevance between index terms  Three classic models: Boolean, vector, probabilistic  ki : A generic index term  K : The set of all index terms {k1,…,kt}  wi,j : A weight associated with index term ki of a document dj For an index term does not appear in the document , wi,j=0  gi: A function returns the weight associated
  • 27. Boolean model  Simple retrieval model based on set theory and Boolean algebra  Binary decision criterion  Either relevant or not relevant (no partial match)  Data retrieval model  Advantage  Simplicity  Disadvantage  It is not simple to translate an information need into a Boolean expression  Exact matching may lead to retrieval of too few or too many documents
  • 28.  Formal definition  For the Boolean model, the index term weight are all binary, i.e. wij {0,1}  A query q is composed of index terms linked by three connectives: not, and , or.  A query is a conventional Boolean expression, which can be represented as a disjunction of conjunctive vectors (in DNF)  q= ka(kbkc)= (1,1,1)  (1,1,0)  (1,0,0) where each of the component is a binary weighted vector associated with the tuple (ka,kb,kc)  The similarity of the document dj to the query q is defined as if (qcc )(ki, wi,j=gi(qcc))     0 1 ),( qdsim j (1,1,1) (1,0,0) (1,1,0)Ka Kb Kc dnfq  dnfq 
  • 29.  If 1 then the boolean model predicts that the document dj is relevant to the query q  Otherwise the prediction is that the document is not relevant. ),( qdsim j
  • 30. Vector model  Assign non-binary weights to index terms in queries and in documents  Compute the similarity between documents and query = Sim(dj,q)  More precise than Boolean model  The weight associated with the pair (ki,dj) positive and non-binary.  The index terms in the query are also weighted  The vector for a document dj is represented by  is the query vector where wi,q be the weight associated with the pair (ki,q) where wi,q >=0  Term weights are used to compute the degree of similarity between documents and the user query. It varies from 0 to +1. ),,,( ),,,( ,,2,1 ,,2,1 qtqq jtjjj wwwq wwwd       q 
  • 31.  A document might be retrieved even if it matches the query only partially.  Degree of similarity  dj q Figure 2.4 The cosine of  is adopted as sim(dj,q)          t i qi t i ji t i qiji j j j ww ww qd qd qdsim 1 2 ,1 2 , 1 ,, |||| ),(  
  • 32.
  • 33.
  • 34.  Advantages  Its term-weighting scheme improves retrieval performance  Its partial matching strategy allows retrieval of documents that approximate the query conditions  Its cosine ranking formula sorts the documents according to their degree of similarity to the query  Disadvantage  The assumption of mutual independence between index terms
  • 35. Probabilistic model  Introduced by Roberston and Sparck Jones, 1976  Binary independence retrieval (BIR) model  Idea: Given a user query q, and the ideal answer set R of the relevant documents, the problem is to specify the properties for this set  Assumption (probabilistic principle): the probability of relevance depends on the query and document representations only; ideal answer set R should maximize the overall probability of relevance  The probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)
  • 36.  Definition  All index term weights are all binary i.e., wi,j  {0,1}  Let R be the set of documents known to be relevant to query q  Let be the complement of R  Let be the probability that the document dj is relevant to the query q  Let be the probability that the document dj is nonelevant to query q )|( jdRP )|( jdRP R
  • 37.
  • 38.
  • 39.  Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R  stands for the probability that the index term ki is not present in a document randomly selected from the set R )|Pr( Rki
  • 40.
  • 44.
  • 45.
  翻译: