尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval
Lecture 7Lecture 7 (book chapter 9)(book chapter 9)::
Parallel and Distributed IRParallel and Distributed IR
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: ConclusionsPrevious Chapter: Conclusions
 How to accelerate search? Same results as sequential
 Ideas:
 Quick-and-dirty rejection of bad objects, 100% recall
 Fast data structure for search (based on clustering)
 Careful check of all found candidates
 Solution: mapping into fewer-D feature space
 Condition: lower-bounding of the distance
 Assumption: skewed spectrum distribution
 Few coefficients concentrate energy, rest are less important
Previous Chapter: Research topicsPrevious Chapter: Research topics
 Object detection (pattern and image recognition)
 Automatic feature selection
 Spatial indexing data structures (more than 1D)
 New types of data.
 What features to select? How to determine them?
 Mixed-type data (e.g., webpages, or images with
sound and description)
 What clustering/IR methods are better suited for
what features? (What features for what methods?)
 Similar methods in data mining, ...
The problemThe problem
 Very large document collections
 Google: 4,000,000,000 pages
 Slow response?
 Solution: parallel computing
 Google: 10,000 computers
Parallel architecturesParallel architectures
Data stream
Single Multiple
Instructionstream
Single
SISD
classical
SIMD
simple
Multiple
MISD
(rare)
MIMD
many SISD
MIMD architectureMIMD architecture
 The most common
 Can be
 tightly coupled
 loosely coupled
 Distributed
 Many computers interacting via network
 PC Clusters
 Similar to MIMD computers, but greater cost of
communication
 very loosely coupled
 More coarse-grained programs
Performance improvementPerformance improvement
Time: speedup S
 Ideally, N times (number of processors)
 In practice impossible
 The problem does not decompose into N equal parts
 Communication and control overhead
 < 1 / f, where f is the largest separable fraction of the
problem
Cost
 Per processor: S / N
Two approaches to parallelismTwo approaches to parallelism
 Build new algorithms
 E.g., neural nets
 Naturally parallel
 Problem: to define the retrieval task
 Adapt the existing techniques to parallelism
 Allows relying on well-studied approaches
 We will consider this option
Ways to use parallelismWays to use parallelism
 Multitasking
 N search engines
 Good for processing many queries
Problems:
 A single query is not speeded up
 Bottleneck: disk access (index)
 Possible solution: replicating (part of) data. RAIDs
 Parallel algorithms
 IR = data. Main question: how to partition the data
 Document / index term matrix
(terms can be LSI dimensions, signature bits, etc)
Possible partitioningsPossible partitionings
 Horizontal: document partitioning. Union of results
 Vertical: term partitioning. Basically, intersect results
Inverted files: Logical partitioningInverted files: Logical partitioning
 Logical vs. physical document partitioning
 Logical: for each term, use pointers into inverted file data for
each processor, to indicate its portion
Inverted files: Logical partitioningInverted files: Logical partitioning
Construction and updatingConstruction and updating
 Also parallel
Construction
 Assign docs to processors
 Order docs such that each processor has an interval
 Process in parallel
 Merge. Each piece is ordered already
Inverted files:Inverted files:
Physical document partitioningPhysical document partitioning
 Several separate collections, one per processor
 Separate indices
 Then the lists are merged (they are already ordered)
 Priority queue is used
 The result is not sorted; Insertion is quick
 The maximal element can be found quickly
 First k elements can be found rather quickly
 Details in the book
 Consistent scores are needed
 Global statistics is needed. Can be computed at index
time
Logical or physical partitioning?Logical or physical partitioning?
 Logical requires less communication
 Faster
 Physical is more flexible. Simpler implementation
 Simpler conversion of existing systems
Inverted files:Inverted files: Term partitioningTerm partitioning
 Each processor processes a part of the inverted file
 The results are intersected (for AND)
 (or as appropriate for Boolean operations, OR and NOT)
 When term distribution in user queries is skewed,
then document partitioning is better
 When uniform, term partitioning is better.
 Twice for long queries, 5 – 10 times for short (Web-like)
Suffix arraysSuffix arrays
 Array construction can be parallelized
 merges are parallel
 Document partitioning is applied straightforwardly
 Each processor maintains its own suffix array
 Term partitioning can be applied
 Each processor owns a branch of the tree (lexicographic
interval)
 Bottleneck: all processors need access to the entire text
Signature filesSignature files
 Document partitioning: straightforward
 Create query signature, distribute to each processor
 Merge results (using Boolean operations if needed)
 Term partitioning: shorter signatures
 Merging and eliminating false drops is slow
 This method is not recommended
SIMD computersSIMD computers
 Single Instruction, Multiple data
 Uncommon
 Good for simple operations
 Bit operations in signature files
 Details in the book
 Ranking is supported in hardware in some computers
 If signature file does not fit into memory, can be
processed in batches
 I/O overhead
 Use multiple queries with the same batch
 This improves throughput, but not response time
…… SIMD computersSIMD computers
 Inverted files are difficult to adapt to SIMD
 The inverted file is restructured
 Details in the book
Distributed IRDistributed IR
 MIMD with
 Slow communication
 Not all nodes are used for a given query
 Encryption issues
 Document partitioning is usually used
 Term partitioning imposes greater communication
overhead
 Document clustering can be useful (to distribute docs
by processors)
 Index clusters and then search only the best ones
 Another approach: use training queries, then similarity of
the user query to these
Research topicsResearch topics
 How to evaluate the speedup
 New algorithms
 Adaptation of existing algorithms
 Merging the results is a bottleneck
 Meta search engines
 Creating large collections with judgements
 Is recall important?
ConclusionsConclusions
 Parallel computing can improve
 response time for each query and/or
 throughput: number of queries processed with same speed
 Document partitioning is simple
 good for distributed computing
 Term partitioning is good for some data structures
 Distributed computing is MIMD computing with slow
communication
 SIMD machines are good for Signature files
 Both are out of favor now
Thank you!
Till May 17? 18?, 6 pm

More Related Content

What's hot

4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
Krish_ver2
 
Signature files
Signature filesSignature files
Signature files
Deepali Raikar
 
Web content mining
Web content miningWeb content mining
Web content mining
Akanksha Dombe
 
Information retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of irInformation retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of ir
Vaibhav Khanna
 
Matching techniques
Matching techniquesMatching techniques
Matching techniques
Nagpalkirti
 
System models in distributed system
System models in distributed systemSystem models in distributed system
System models in distributed system
ishapadhy
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
JYoTHiSH o.s
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelism
Pankaj Kumar Jain
 
World wide web architecture presentation
World wide web architecture presentationWorld wide web architecture presentation
World wide web architecture presentation
ImMe Khan
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
Pier Luca Lanzi
 
Web crawler
Web crawlerWeb crawler
Web crawler
poonamkenkre
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
Kavya Barnadhya Hazarika
 
Database , 4 Data Integration
Database , 4 Data IntegrationDatabase , 4 Data Integration
Database , 4 Data Integration
Ali Usman
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
message passing vs shared memory
message passing vs shared memorymessage passing vs shared memory
message passing vs shared memory
Hamza Zahid
 
Implementation levels of virtualization
Implementation levels of virtualizationImplementation levels of virtualization
Implementation levels of virtualization
Gokulnath S
 
Communications is distributed systems
Communications is distributed systemsCommunications is distributed systems
Communications is distributed systems
SHATHAN
 
Using prior knowledge to initialize the hypothesis,kbann
Using prior knowledge to initialize the hypothesis,kbannUsing prior knowledge to initialize the hypothesis,kbann
Using prior knowledge to initialize the hypothesis,kbann
swapnac12
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
Adarsh Kumar Yadav
 

What's hot (20)

4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Signature files
Signature filesSignature files
Signature files
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Information retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of irInformation retrieval 14 fuzzy set models of ir
Information retrieval 14 fuzzy set models of ir
 
Matching techniques
Matching techniquesMatching techniques
Matching techniques
 
System models in distributed system
System models in distributed systemSystem models in distributed system
System models in distributed system
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelism
 
World wide web architecture presentation
World wide web architecture presentationWorld wide web architecture presentation
World wide web architecture presentation
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Database , 4 Data Integration
Database , 4 Data IntegrationDatabase , 4 Data Integration
Database , 4 Data Integration
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
message passing vs shared memory
message passing vs shared memorymessage passing vs shared memory
message passing vs shared memory
 
Implementation levels of virtualization
Implementation levels of virtualizationImplementation levels of virtualization
Implementation levels of virtualization
 
Communications is distributed systems
Communications is distributed systemsCommunications is distributed systems
Communications is distributed systems
 
Using prior knowledge to initialize the hypothesis,kbann
Using prior knowledge to initialize the hypothesis,kbannUsing prior knowledge to initialize the hypothesis,kbann
Using prior knowledge to initialize the hypothesis,kbann
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 

Viewers also liked

Presentation parallelsystem
Presentation parallelsystemPresentation parallelsystem
Presentation parallelsystem
cegonsoft1999
 
Centralized vs distrbution system
Centralized vs distrbution systemCentralized vs distrbution system
Centralized vs distrbution system
zirram
 
Centralised and distributed databases
Centralised and distributed databasesCentralised and distributed databases
Centralised and distributed databases
Forrester High School
 
Cab booking system india
Cab booking system indiaCab booking system india
Cab booking system india
Custom Soft
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
Sudarsun Santhiappan
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Jewel Refran
 
Parallel and Distributed System IEEE 2014 Projects
Parallel and Distributed System IEEE 2014 ProjectsParallel and Distributed System IEEE 2014 Projects
Parallel and Distributed System IEEE 2014 Projects
Vijay Karan
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
VESIT/University of Mumbai
 

Viewers also liked (8)

Presentation parallelsystem
Presentation parallelsystemPresentation parallelsystem
Presentation parallelsystem
 
Centralized vs distrbution system
Centralized vs distrbution systemCentralized vs distrbution system
Centralized vs distrbution system
 
Centralised and distributed databases
Centralised and distributed databasesCentralised and distributed databases
Centralised and distributed databases
 
Cab booking system india
Cab booking system indiaCab booking system india
Cab booking system india
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Parallel and Distributed System IEEE 2014 Projects
Parallel and Distributed System IEEE 2014 ProjectsParallel and Distributed System IEEE 2014 Projects
Parallel and Distributed System IEEE 2014 Projects
 
Parallel Database
Parallel DatabaseParallel Database
Parallel Database
 

Similar to Parallel and Distributed Information Retrieval System

SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
San Diego Supercomputer Center
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
Yahoo Developer Network
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systems
unyil96
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)
MUHAMMAD AAMIR
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
Nick Boucart
 
Data Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvementsData Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvements
Umair Amjad
 
Grid1
Grid1Grid1
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
Chelle Gentemann
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
IJERA Editor
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Introduction
IntroductionIntroduction
Introduction
sarojbhavaraju5
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
PNUTS
PNUTSPNUTS
Pnuts
PnutsPnuts
Pnuts Review
Pnuts ReviewPnuts Review
Pnuts Review
Ruchika Mehresh
 

Similar to Parallel and Distributed Information Retrieval System (20)

SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systems
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)Chapter 1( intro &amp; overview)
Chapter 1( intro &amp; overview)
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
 
Data Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvementsData Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvements
 
Grid1
Grid1Grid1
Grid1
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Introduction
IntroductionIntroduction
Introduction
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
PNUTS
PNUTSPNUTS
PNUTS
 
Pnuts
PnutsPnuts
Pnuts
 
Pnuts Review
Pnuts ReviewPnuts Review
Pnuts Review
 

Recently uploaded

Mahipalpur Call Girls Delhi 🔥 9711199012 ❄- Pick Your Dream Call Girls with 1...
Mahipalpur Call Girls Delhi 🔥 9711199012 ❄- Pick Your Dream Call Girls with 1...Mahipalpur Call Girls Delhi 🔥 9711199012 ❄- Pick Your Dream Call Girls with 1...
Mahipalpur Call Girls Delhi 🔥 9711199012 ❄- Pick Your Dream Call Girls with 1...
simrangupta87541
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
drshikhapandey2022
 
BBOC407 Module 1.pptx Biology for Engineers
BBOC407  Module 1.pptx Biology for EngineersBBOC407  Module 1.pptx Biology for Engineers
BBOC407 Module 1.pptx Biology for Engineers
sathishkumars808912
 
Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Call Girls Madurai 8824825030 Escort In Madurai service 24X7Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Poonam Singh
 
❣Unsatisfied Bhabhi Call Girls Surat 💯Call Us 🔝 7014168258 🔝💃Independent Sura...
❣Unsatisfied Bhabhi Call Girls Surat 💯Call Us 🔝 7014168258 🔝💃Independent Sura...❣Unsatisfied Bhabhi Call Girls Surat 💯Call Us 🔝 7014168258 🔝💃Independent Sura...
❣Unsatisfied Bhabhi Call Girls Surat 💯Call Us 🔝 7014168258 🔝💃Independent Sura...
hotchicksescort
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
gapboxn
 
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptxMODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
NaveenNaveen726446
 
Better Builder Magazine, Issue 49 / Spring 2024
Better Builder Magazine, Issue 49 / Spring 2024Better Builder Magazine, Issue 49 / Spring 2024
Better Builder Magazine, Issue 49 / Spring 2024
Better Builder Magazine
 
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
nonods
 
Data Communication and Computer Networks Management System Project Report.pdf
Data Communication and Computer Networks Management System Project Report.pdfData Communication and Computer Networks Management System Project Report.pdf
Data Communication and Computer Networks Management System Project Report.pdf
Kamal Acharya
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
sapna sharmap11
 
CSP_Study - Notes (Paul McNeill) 2017.pdf
CSP_Study - Notes (Paul McNeill) 2017.pdfCSP_Study - Notes (Paul McNeill) 2017.pdf
CSP_Study - Notes (Paul McNeill) 2017.pdf
Ismail Sultan
 
Technological Innovation Management And Entrepreneurship-1.pdf
Technological Innovation Management And Entrepreneurship-1.pdfTechnological Innovation Management And Entrepreneurship-1.pdf
Technological Innovation Management And Entrepreneurship-1.pdf
tanujaharish2
 
Call Girls In Tiruppur 👯‍♀️ 7339748667 🔥 Free Home Delivery Within 30 Minutes
Call Girls In Tiruppur 👯‍♀️ 7339748667 🔥 Free Home Delivery Within 30 MinutesCall Girls In Tiruppur 👯‍♀️ 7339748667 🔥 Free Home Delivery Within 30 Minutes
Call Girls In Tiruppur 👯‍♀️ 7339748667 🔥 Free Home Delivery Within 30 Minutes
kamka4105
 
🔥Young College Call Girls Chandigarh 💯Call Us 🔝 7737669865 🔝💃Independent Chan...
🔥Young College Call Girls Chandigarh 💯Call Us 🔝 7737669865 🔝💃Independent Chan...🔥Young College Call Girls Chandigarh 💯Call Us 🔝 7737669865 🔝💃Independent Chan...
🔥Young College Call Girls Chandigarh 💯Call Us 🔝 7737669865 🔝💃Independent Chan...
sonamrawat5631
 
paper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdfpaper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdf
ShurooqTaib
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Banerescorts
 
Online train ticket booking system project.pdf
Online train ticket booking system project.pdfOnline train ticket booking system project.pdf
Online train ticket booking system project.pdf
Kamal Acharya
 
🔥Independent Call Girls In Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Esco...
🔥Independent Call Girls In Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Esco...🔥Independent Call Girls In Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Esco...
🔥Independent Call Girls In Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Esco...
AK47
 

Recently uploaded (20)

Mahipalpur Call Girls Delhi 🔥 9711199012 ❄- Pick Your Dream Call Girls with 1...
Mahipalpur Call Girls Delhi 🔥 9711199012 ❄- Pick Your Dream Call Girls with 1...Mahipalpur Call Girls Delhi 🔥 9711199012 ❄- Pick Your Dream Call Girls with 1...
Mahipalpur Call Girls Delhi 🔥 9711199012 ❄- Pick Your Dream Call Girls with 1...
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
 
BBOC407 Module 1.pptx Biology for Engineers
BBOC407  Module 1.pptx Biology for EngineersBBOC407  Module 1.pptx Biology for Engineers
BBOC407 Module 1.pptx Biology for Engineers
 
Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Call Girls Madurai 8824825030 Escort In Madurai service 24X7Call Girls Madurai 8824825030 Escort In Madurai service 24X7
Call Girls Madurai 8824825030 Escort In Madurai service 24X7
 
❣Unsatisfied Bhabhi Call Girls Surat 💯Call Us 🔝 7014168258 🔝💃Independent Sura...
❣Unsatisfied Bhabhi Call Girls Surat 💯Call Us 🔝 7014168258 🔝💃Independent Sura...❣Unsatisfied Bhabhi Call Girls Surat 💯Call Us 🔝 7014168258 🔝💃Independent Sura...
❣Unsatisfied Bhabhi Call Girls Surat 💯Call Us 🔝 7014168258 🔝💃Independent Sura...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptxMODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
MODULE 5 BIOLOGY FOR ENGINEERS TRENDS IN BIO ENGINEERING.pptx
 
Better Builder Magazine, Issue 49 / Spring 2024
Better Builder Magazine, Issue 49 / Spring 2024Better Builder Magazine, Issue 49 / Spring 2024
Better Builder Magazine, Issue 49 / Spring 2024
 
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
一比一原版(psu学位证书)美国匹兹堡州立大学毕业证如何办理
 
Data Communication and Computer Networks Management System Project Report.pdf
Data Communication and Computer Networks Management System Project Report.pdfData Communication and Computer Networks Management System Project Report.pdf
Data Communication and Computer Networks Management System Project Report.pdf
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
 
CSP_Study - Notes (Paul McNeill) 2017.pdf
CSP_Study - Notes (Paul McNeill) 2017.pdfCSP_Study - Notes (Paul McNeill) 2017.pdf
CSP_Study - Notes (Paul McNeill) 2017.pdf
 
Technological Innovation Management And Entrepreneurship-1.pdf
Technological Innovation Management And Entrepreneurship-1.pdfTechnological Innovation Management And Entrepreneurship-1.pdf
Technological Innovation Management And Entrepreneurship-1.pdf
 
Call Girls In Tiruppur 👯‍♀️ 7339748667 🔥 Free Home Delivery Within 30 Minutes
Call Girls In Tiruppur 👯‍♀️ 7339748667 🔥 Free Home Delivery Within 30 MinutesCall Girls In Tiruppur 👯‍♀️ 7339748667 🔥 Free Home Delivery Within 30 Minutes
Call Girls In Tiruppur 👯‍♀️ 7339748667 🔥 Free Home Delivery Within 30 Minutes
 
🔥Young College Call Girls Chandigarh 💯Call Us 🔝 7737669865 🔝💃Independent Chan...
🔥Young College Call Girls Chandigarh 💯Call Us 🔝 7737669865 🔝💃Independent Chan...🔥Young College Call Girls Chandigarh 💯Call Us 🔝 7737669865 🔝💃Independent Chan...
🔥Young College Call Girls Chandigarh 💯Call Us 🔝 7737669865 🔝💃Independent Chan...
 
paper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdfpaper relate Chozhavendhan et al. 2020.pdf
paper relate Chozhavendhan et al. 2020.pdf
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
Hot Call Girls In Bangalore ✔ 9079923931 ✔ Hi I Am Divya Vip Call Girl Servic...
 
Online train ticket booking system project.pdf
Online train ticket booking system project.pdfOnline train ticket booking system project.pdf
Online train ticket booking system project.pdf
 
🔥Independent Call Girls In Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Esco...
🔥Independent Call Girls In Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Esco...🔥Independent Call Girls In Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Esco...
🔥Independent Call Girls In Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Esco...
 

Parallel and Distributed Information Retrieval System

  • 1. Special Topics in Computer ScienceSpecial Topics in Computer Science Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval Lecture 7Lecture 7 (book chapter 9)(book chapter 9):: Parallel and Distributed IRParallel and Distributed IR Alexander Gelbukh www.Gelbukh.com
  • 2. Previous Chapter: ConclusionsPrevious Chapter: Conclusions  How to accelerate search? Same results as sequential  Ideas:  Quick-and-dirty rejection of bad objects, 100% recall  Fast data structure for search (based on clustering)  Careful check of all found candidates  Solution: mapping into fewer-D feature space  Condition: lower-bounding of the distance  Assumption: skewed spectrum distribution  Few coefficients concentrate energy, rest are less important
  • 3. Previous Chapter: Research topicsPrevious Chapter: Research topics  Object detection (pattern and image recognition)  Automatic feature selection  Spatial indexing data structures (more than 1D)  New types of data.  What features to select? How to determine them?  Mixed-type data (e.g., webpages, or images with sound and description)  What clustering/IR methods are better suited for what features? (What features for what methods?)  Similar methods in data mining, ...
  • 4. The problemThe problem  Very large document collections  Google: 4,000,000,000 pages  Slow response?  Solution: parallel computing  Google: 10,000 computers
  • 5. Parallel architecturesParallel architectures Data stream Single Multiple Instructionstream Single SISD classical SIMD simple Multiple MISD (rare) MIMD many SISD
  • 6. MIMD architectureMIMD architecture  The most common  Can be  tightly coupled  loosely coupled  Distributed  Many computers interacting via network  PC Clusters  Similar to MIMD computers, but greater cost of communication  very loosely coupled  More coarse-grained programs
  • 7. Performance improvementPerformance improvement Time: speedup S  Ideally, N times (number of processors)  In practice impossible  The problem does not decompose into N equal parts  Communication and control overhead  < 1 / f, where f is the largest separable fraction of the problem Cost  Per processor: S / N
  • 8. Two approaches to parallelismTwo approaches to parallelism  Build new algorithms  E.g., neural nets  Naturally parallel  Problem: to define the retrieval task  Adapt the existing techniques to parallelism  Allows relying on well-studied approaches  We will consider this option
  • 9. Ways to use parallelismWays to use parallelism  Multitasking  N search engines  Good for processing many queries Problems:  A single query is not speeded up  Bottleneck: disk access (index)  Possible solution: replicating (part of) data. RAIDs  Parallel algorithms  IR = data. Main question: how to partition the data  Document / index term matrix (terms can be LSI dimensions, signature bits, etc)
  • 10. Possible partitioningsPossible partitionings  Horizontal: document partitioning. Union of results  Vertical: term partitioning. Basically, intersect results
  • 11. Inverted files: Logical partitioningInverted files: Logical partitioning  Logical vs. physical document partitioning  Logical: for each term, use pointers into inverted file data for each processor, to indicate its portion
  • 12. Inverted files: Logical partitioningInverted files: Logical partitioning Construction and updatingConstruction and updating  Also parallel Construction  Assign docs to processors  Order docs such that each processor has an interval  Process in parallel  Merge. Each piece is ordered already
  • 13. Inverted files:Inverted files: Physical document partitioningPhysical document partitioning  Several separate collections, one per processor  Separate indices  Then the lists are merged (they are already ordered)  Priority queue is used  The result is not sorted; Insertion is quick  The maximal element can be found quickly  First k elements can be found rather quickly  Details in the book  Consistent scores are needed  Global statistics is needed. Can be computed at index time
  • 14. Logical or physical partitioning?Logical or physical partitioning?  Logical requires less communication  Faster  Physical is more flexible. Simpler implementation  Simpler conversion of existing systems
  • 15. Inverted files:Inverted files: Term partitioningTerm partitioning  Each processor processes a part of the inverted file  The results are intersected (for AND)  (or as appropriate for Boolean operations, OR and NOT)  When term distribution in user queries is skewed, then document partitioning is better  When uniform, term partitioning is better.  Twice for long queries, 5 – 10 times for short (Web-like)
  • 16. Suffix arraysSuffix arrays  Array construction can be parallelized  merges are parallel  Document partitioning is applied straightforwardly  Each processor maintains its own suffix array  Term partitioning can be applied  Each processor owns a branch of the tree (lexicographic interval)  Bottleneck: all processors need access to the entire text
  • 17.
  • 18. Signature filesSignature files  Document partitioning: straightforward  Create query signature, distribute to each processor  Merge results (using Boolean operations if needed)  Term partitioning: shorter signatures  Merging and eliminating false drops is slow  This method is not recommended
  • 19. SIMD computersSIMD computers  Single Instruction, Multiple data  Uncommon  Good for simple operations  Bit operations in signature files  Details in the book  Ranking is supported in hardware in some computers  If signature file does not fit into memory, can be processed in batches  I/O overhead  Use multiple queries with the same batch  This improves throughput, but not response time
  • 20. …… SIMD computersSIMD computers  Inverted files are difficult to adapt to SIMD  The inverted file is restructured  Details in the book
  • 21. Distributed IRDistributed IR  MIMD with  Slow communication  Not all nodes are used for a given query  Encryption issues  Document partitioning is usually used  Term partitioning imposes greater communication overhead  Document clustering can be useful (to distribute docs by processors)  Index clusters and then search only the best ones  Another approach: use training queries, then similarity of the user query to these
  • 22. Research topicsResearch topics  How to evaluate the speedup  New algorithms  Adaptation of existing algorithms  Merging the results is a bottleneck  Meta search engines  Creating large collections with judgements  Is recall important?
  • 23. ConclusionsConclusions  Parallel computing can improve  response time for each query and/or  throughput: number of queries processed with same speed  Document partitioning is simple  good for distributed computing  Term partitioning is good for some data structures  Distributed computing is MIMD computing with slow communication  SIMD machines are good for Signature files  Both are out of favor now
  • 24. Thank you! Till May 17? 18?, 6 pm
  翻译: