尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
M
KU

Y

F
O

OM P
AT OO
AN AD
H

LE
FI

AD
E
R

IN

AR

A
OM
.C
ND
OO
NA
H
A
YA
SH
K@
JE
90
RA
12
H_
ES
AJ
R
A SAMPLE HADOOP CLUSTER
Data center D1
Name Node
Rack R1

R1N1

R1N2

R1N3

R1N4

Rack R2

R2N1

R2N2

R2N3

R2N4

1. This is a Hadoop cluster with one name node and two racks named R1 and R2 in a data center D1.
Each rack has 4 nodes and they are uniquely identified as R1N1, R1N2 and so on.
2. Replication factor is 3.
3. HDFS block size is 64 MB.
4. This cluster is used as an example to explain the concepts.
FACTS TO BE KNOW
1. Name node saves part of HDFS metadata like file location, permission, etc. in files
called namespace image and edit logs. Files are stored in HDFS as blocks. These
block information are not saved in any file. Instead it is gathered every time the
cluster is started. And this information is stored in name node’s memory.
2. Replica Placement : Assuming the replication factor is 3; When a file is written from
a data node (say R1N1), Hadoop attempts to save the first replica in same data
node (R1N1). Second replica is written into another node (R2N2) in a different rack
(R2). Third replica is written into another node (R2N1) in the same rack (R2) where
the second replica was saved.
3. Hadoop takes a simple approach in which the network is represented as a tree and
the distance between two nodes is the sum of their distances to their closest
common ancestor. The levels can be like; “Data Center” > “Rack” > “Node”.
Example; ‘/d1/r1/n1’ is a representation for a node named n1 on rack r1 in data
center d1. Distance calculation has 4 possible scenarios as;

1. distance(/d1/r1/n1, /d1/r1/n1) = 0 [Processes on same
node]
2. distance(/d1/r1/n1, /d1/r1/n2) = 2 [different node is
same rack]
3. distance(/d1/r1/n1, /d1/r2/n3) = 4 [node in different rack
ANATOMY OF FILE READ – HAPPY PATH
B1
sfo_crimes.csv

R2N1

R2N2

B2

R1N1 R2N3

R2N4

B3

Name Node

R1N1

R1N1 R2N3

R2N1

Metadata

Rack R1
B1

B2

B3

R1N1

R1N2

R1N4

R1N3

Rack R2
B3

B1

R2N1
•
•
•
•
•
•

B1

B3

R2N2

B2

R2N3

B2

R2N4

Let’s assume a file named “sfo_crime.csv” of size 192 MB is saved in this cluster.
Also assume that the file was written from node R1N1.
Metadata is written in name node.
The file is split into 3 blocks each of size 64 MB. And each block is copied 3 times in the cluster.
Along with data, a checksum will be saved in each block. This is used to ensure the data read
from the block is read with out error.
When cluster is started, the metadata will look as shown on top right corner.
HDFS
Client

open()

RPC call to get first few blocks of file

DistributedFileSystem
B1 (R1N1, R2N1, R2N2)
B2 (R1N1, R2N3, R2N4)

FSDataInputStream

Name Node

DFSInputStream
B1 (R1N1, R2N1, R2N2)
B2 (R1N1, R2N3, R2N4)

B1

R2N2
R2N1

RIN2 JVM

R1N1

B2

R2N4
R2N3
R1N1

B3

R2N1
R2N3

R1N1
• When the cluster is up and running, the name node looks like how its
shown here (right-side).
Metadata
• Let’s say we are trying to read the “sfo_crimes.csv” file from R1N2.
• So a HDFS Client program will run on R1N2’s JVM.
• First the HDFS client program calls the method open() on a Java class
DistributedFileSystem (subclass of FileSystem).
• DFS makes a RPC call returns first few blocks on the file. NN returns the address of the
DN ORDERED with respect to the node from where the read is performed.
• The block information is saved in DFSInputStream which is wrapped in
FSDataInputStream.
• In response to ‘FileSystem.open()’, HDFS Client receives this FSDataInputStream.
HDFS
Client

read()

FSDataInputStream
DFSInputStream
B1 (R1N1, R2N1, R2N2)
B2 (R1N1, R2N3, R2N4)

Name Node

RIN2 JVM

Data streamed to
client directly from
data node.

DFSIS connects to
R1N1 to read block
B1

R1N1

•
•
•
•

From now on HDFS Client deals with FSDataInputStream (FSDIS).
HDFS Client invokes read() on the stream.
Blocks are read in order. DFSIS connects to the closest node (R1N1) to read block B1.
DFSIS connects to data node and streams data to client, which calls read() repeatedly
on the stream. DFSIS verifies checksums for the data transferred to client.
• When the block is read completely, DFSIS closes the connection.
HDFS
Client

read()

FSDataInputStream
DFSInputStream
B1 (R1N1, R2N1, R2N2)
B2 (R1N1, R2N3, R2N4)

Name Node

RIN2 JVM

Data streamed to
client directly from
data node.

DFSIS connects to
R1N1 to read block
B2

R1N1

• Next DFSIS attempts to read block B2. As mentioned earlier, the previous connection is
closed and a fresh connection is made to the closest node (R1N1) of block B2.
HDFS
Client

read()

FSDataInputStream

close()

DFSInputStream

Name Node

B3 (R1N1, R2N1, R2N3)
B3 (R1N1, R2N1, R2N3)

RIN2 JVM

Data streamed to
client directly from
data node.

DFSIS connects to
R1N1 to read block
B3

R1N1

• Now DFSIS has read all blocks returned by the first RPC call (B1 & B2). But the file is not
read completely. In our case there is one more block to read.
• DFSIS calls name node to get data node locations for next batch of blocks as needed.
• After the complete file is read for the HDFS client call close().
ANATOMY OF FILE READ – DATA NODE
CONNECTION ERROR
HDFS
Client

read()

FSDataInputStream
DFSInputStream

R1N1

Name Node

B1 (R1N1, R2N1, R2N2)
B2 (R1N1, R2N3, R2N4)

RIN2 JVM

Data streamed to
client directly from
data node.

DFSIS connects to
R1N1 to read block
B2

R1N1

R2N3

• Let’s say there is some error while connecting to R1N1.
• DFSIS remembers this info, so it won’t try to read from R1N1 for future blocks. Then it
tries to connect to next closest node (R2N3).
ANATOMY OF FILE READ – DATA NODE
CHECKSUM ERROR
HDFS
Client

read()

FSDataInputStream
DFSInputStream

Inform name node
that the block in
R1N1 is corrupt.

B1 (R1N1, R2N1, R2N2)
B2 (R1N1, R2N3, R2N4)

Name Node

RIN2 JVM

Data streamed to
client directly from
data node.

DFSIS connects to
R1N1 to read block
B2

R1N1

R2N3

• Let’s say there is a checksum error. This means the block is corrupt.
• Information about this corrupt block is sent to name node. Then DFSIS tries to connect
to next closest node (R2N3).
THE END

SORRY FOR MY POOR ENGLISH. 
PLEASE SEND YOUR VALUABLE FEEDBACK TO
RAJESH_1290K@YAHOO.COM

More Related Content

What's hot

Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
 
Dns
DnsDns
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Ontologie concept applications
Ontologie concept applicationsOntologie concept applications
Ontologie concept applications
benouini rachid
 
Express js
Express jsExpress js
Express js
Manav Prasad
 
LDAP
LDAPLDAP
web service technologies
web service technologiesweb service technologies
web service technologies
Yash Darak
 
Php.ppt
Php.pptPhp.ppt
Php.ppt
Nidhi mishra
 
Web services
Web servicesWeb services
Web services
Akshay Ballarpure
 
Les attaques par injection sql
Les attaques par injection sqlLes attaques par injection sql
Les attaques par injection sql
Mohamed Yassin
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
mattlieber
 
Computer Networks - DNS
Computer Networks - DNSComputer Networks - DNS
Computer Networks - DNS
DHIVYADEVAKI
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
Intrusion Detection System using Snort
Intrusion Detection System using Snort Intrusion Detection System using Snort
Intrusion Detection System using Snort
webhostingguy
 
SDN-Security
SDN-SecuritySDN-Security
pfe book 2023 2024.pdf
pfe book 2023 2024.pdfpfe book 2023 2024.pdf
pfe book 2023 2024.pdf
khalil lakhdhar
 
DNS Attacks
DNS AttacksDNS Attacks
DNS Attacks
Himanshu Prabhakar
 
Virtual Private Network
Virtual Private NetworkVirtual Private Network
Virtual Private Network
HASHIR RAZA
 
Formation JAVA/J2EE
Formation JAVA/J2EEFormation JAVA/J2EE
Formation JAVA/J2EE
Ines Ouaz
 
Firewall and Types of firewall
Firewall and Types of firewallFirewall and Types of firewall
Firewall and Types of firewall
Coder Tech
 

What's hot (20)

Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Dns
DnsDns
Dns
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Ontologie concept applications
Ontologie concept applicationsOntologie concept applications
Ontologie concept applications
 
Express js
Express jsExpress js
Express js
 
LDAP
LDAPLDAP
LDAP
 
web service technologies
web service technologiesweb service technologies
web service technologies
 
Php.ppt
Php.pptPhp.ppt
Php.ppt
 
Web services
Web servicesWeb services
Web services
 
Les attaques par injection sql
Les attaques par injection sqlLes attaques par injection sql
Les attaques par injection sql
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
Computer Networks - DNS
Computer Networks - DNSComputer Networks - DNS
Computer Networks - DNS
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Intrusion Detection System using Snort
Intrusion Detection System using Snort Intrusion Detection System using Snort
Intrusion Detection System using Snort
 
SDN-Security
SDN-SecuritySDN-Security
SDN-Security
 
pfe book 2023 2024.pdf
pfe book 2023 2024.pdfpfe book 2023 2024.pdf
pfe book 2023 2024.pdf
 
DNS Attacks
DNS AttacksDNS Attacks
DNS Attacks
 
Virtual Private Network
Virtual Private NetworkVirtual Private Network
Virtual Private Network
 
Formation JAVA/J2EE
Formation JAVA/J2EEFormation JAVA/J2EE
Formation JAVA/J2EE
 
Firewall and Types of firewall
Firewall and Types of firewallFirewall and Types of firewall
Firewall and Types of firewall
 

Viewers also liked

Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
Rajesh Ananda Kumar
 
Anatomy of Hadoop YARN
Anatomy of Hadoop YARNAnatomy of Hadoop YARN
Anatomy of Hadoop YARN
Rajesh Ananda Kumar
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
Rajesh Ananda Kumar
 
Unit i introduction to grid computing
Unit i   introduction to grid computingUnit i   introduction to grid computing
Unit i introduction to grid computing
sudha kar
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
Bernard Marr
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Grid computing ppt
Grid computing pptGrid computing ppt
Grid computing ppt
Richa Chaudhary
 
Grid computing Seminar PPT
Grid computing Seminar PPTGrid computing Seminar PPT
Grid computing Seminar PPT
Upender Upr
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
Bernard Marr
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 

Viewers also liked (11)

Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Anatomy of Hadoop YARN
Anatomy of Hadoop YARNAnatomy of Hadoop YARN
Anatomy of Hadoop YARN
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
 
Unit i introduction to grid computing
Unit i   introduction to grid computingUnit i   introduction to grid computing
Unit i introduction to grid computing
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Grid computing ppt
Grid computing pptGrid computing ppt
Grid computing ppt
 
Grid computing Seminar PPT
Grid computing Seminar PPTGrid computing Seminar PPT
Grid computing Seminar PPT
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Anatomy of file read in hadoop

HDFS+basics.pptx
HDFS+basics.pptxHDFS+basics.pptx
HDFS+basics.pptx
Ayush .
 
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
gamer129
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Wireshark Lab DNS v6.01 Supplement to Computer Networkin.docx
Wireshark Lab DNS v6.01  Supplement to Computer Networkin.docxWireshark Lab DNS v6.01  Supplement to Computer Networkin.docx
Wireshark Lab DNS v6.01 Supplement to Computer Networkin.docx
alanfhall8953
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
Konstantin V. Shvachko
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 
DNS - Domain Name System
DNS - Domain Name SystemDNS - Domain Name System
DNS - Domain Name System
Peter R. Egli
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Yahoo Developer Network
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
SatyaHadoop
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Siddharth Mathur
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
MindsMapped Consulting
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
Nfs
NfsNfs
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
DataWorks Summit
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
sunithachphd
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
2_Chapter 2_DNS.pptx
2_Chapter 2_DNS.pptx2_Chapter 2_DNS.pptx
2_Chapter 2_DNS.pptx
hoangdinhhanh88
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
Thirunavukkarasu Ps
 
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAILDNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
Utah Networxs Consultoria e Treinamento
 

Similar to Anatomy of file read in hadoop (20)

HDFS+basics.pptx
HDFS+basics.pptxHDFS+basics.pptx
HDFS+basics.pptx
 
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
3 HDFS basicsaaaaaaaaaaaaaaaaaaaaaaaa.ppt
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Wireshark Lab DNS v6.01 Supplement to Computer Networkin.docx
Wireshark Lab DNS v6.01  Supplement to Computer Networkin.docxWireshark Lab DNS v6.01  Supplement to Computer Networkin.docx
Wireshark Lab DNS v6.01 Supplement to Computer Networkin.docx
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
DNS - Domain Name System
DNS - Domain Name SystemDNS - Domain Name System
DNS - Domain Name System
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Nfs
NfsNfs
Nfs
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
2_Chapter 2_DNS.pptx
2_Chapter 2_DNS.pptx2_Chapter 2_DNS.pptx
2_Chapter 2_DNS.pptx
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAILDNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
 

Recently uploaded

Facilitation Skills - When to Use and Why.pptx
Facilitation Skills - When to Use and Why.pptxFacilitation Skills - When to Use and Why.pptx
Facilitation Skills - When to Use and Why.pptx
Knoldus Inc.
 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
NTTDATA INTRAMART
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
An All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS MarketAn All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS Market
ScyllaDB
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
Enterprise Knowledge
 
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
Safe Software
 
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to SuccessMongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State StoreElasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
ScyllaDB
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
ThousandEyes
 
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
UiPathCommunity
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
ThousandEyes
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
ScyllaDB
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
UmmeSalmaM1
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
ThousandEyes
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 

Recently uploaded (20)

Facilitation Skills - When to Use and Why.pptx
Facilitation Skills - When to Use and Why.pptxFacilitation Skills - When to Use and Why.pptx
Facilitation Skills - When to Use and Why.pptx
 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
 
An All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS MarketAn All-Around Benchmark of the DBaaS Market
An All-Around Benchmark of the DBaaS Market
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
 
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
 
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to SuccessMongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
 
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State StoreElasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
 
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 

Anatomy of file read in hadoop

  • 1. M KU Y F O OM P AT OO AN AD H LE FI AD E R IN AR A OM .C ND OO NA H A YA SH K@ JE 90 RA 12 H_ ES AJ R
  • 2. A SAMPLE HADOOP CLUSTER
  • 3. Data center D1 Name Node Rack R1 R1N1 R1N2 R1N3 R1N4 Rack R2 R2N1 R2N2 R2N3 R2N4 1. This is a Hadoop cluster with one name node and two racks named R1 and R2 in a data center D1. Each rack has 4 nodes and they are uniquely identified as R1N1, R1N2 and so on. 2. Replication factor is 3. 3. HDFS block size is 64 MB. 4. This cluster is used as an example to explain the concepts.
  • 4. FACTS TO BE KNOW
  • 5. 1. Name node saves part of HDFS metadata like file location, permission, etc. in files called namespace image and edit logs. Files are stored in HDFS as blocks. These block information are not saved in any file. Instead it is gathered every time the cluster is started. And this information is stored in name node’s memory. 2. Replica Placement : Assuming the replication factor is 3; When a file is written from a data node (say R1N1), Hadoop attempts to save the first replica in same data node (R1N1). Second replica is written into another node (R2N2) in a different rack (R2). Third replica is written into another node (R2N1) in the same rack (R2) where the second replica was saved. 3. Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. The levels can be like; “Data Center” > “Rack” > “Node”. Example; ‘/d1/r1/n1’ is a representation for a node named n1 on rack r1 in data center d1. Distance calculation has 4 possible scenarios as; 1. distance(/d1/r1/n1, /d1/r1/n1) = 0 [Processes on same node] 2. distance(/d1/r1/n1, /d1/r1/n2) = 2 [different node is same rack] 3. distance(/d1/r1/n1, /d1/r2/n3) = 4 [node in different rack
  • 6. ANATOMY OF FILE READ – HAPPY PATH
  • 7. B1 sfo_crimes.csv R2N1 R2N2 B2 R1N1 R2N3 R2N4 B3 Name Node R1N1 R1N1 R2N3 R2N1 Metadata Rack R1 B1 B2 B3 R1N1 R1N2 R1N4 R1N3 Rack R2 B3 B1 R2N1 • • • • • • B1 B3 R2N2 B2 R2N3 B2 R2N4 Let’s assume a file named “sfo_crime.csv” of size 192 MB is saved in this cluster. Also assume that the file was written from node R1N1. Metadata is written in name node. The file is split into 3 blocks each of size 64 MB. And each block is copied 3 times in the cluster. Along with data, a checksum will be saved in each block. This is used to ensure the data read from the block is read with out error. When cluster is started, the metadata will look as shown on top right corner.
  • 8. HDFS Client open() RPC call to get first few blocks of file DistributedFileSystem B1 (R1N1, R2N1, R2N2) B2 (R1N1, R2N3, R2N4) FSDataInputStream Name Node DFSInputStream B1 (R1N1, R2N1, R2N2) B2 (R1N1, R2N3, R2N4) B1 R2N2 R2N1 RIN2 JVM R1N1 B2 R2N4 R2N3 R1N1 B3 R2N1 R2N3 R1N1 • When the cluster is up and running, the name node looks like how its shown here (right-side). Metadata • Let’s say we are trying to read the “sfo_crimes.csv” file from R1N2. • So a HDFS Client program will run on R1N2’s JVM. • First the HDFS client program calls the method open() on a Java class DistributedFileSystem (subclass of FileSystem). • DFS makes a RPC call returns first few blocks on the file. NN returns the address of the DN ORDERED with respect to the node from where the read is performed. • The block information is saved in DFSInputStream which is wrapped in FSDataInputStream. • In response to ‘FileSystem.open()’, HDFS Client receives this FSDataInputStream.
  • 9. HDFS Client read() FSDataInputStream DFSInputStream B1 (R1N1, R2N1, R2N2) B2 (R1N1, R2N3, R2N4) Name Node RIN2 JVM Data streamed to client directly from data node. DFSIS connects to R1N1 to read block B1 R1N1 • • • • From now on HDFS Client deals with FSDataInputStream (FSDIS). HDFS Client invokes read() on the stream. Blocks are read in order. DFSIS connects to the closest node (R1N1) to read block B1. DFSIS connects to data node and streams data to client, which calls read() repeatedly on the stream. DFSIS verifies checksums for the data transferred to client. • When the block is read completely, DFSIS closes the connection.
  • 10. HDFS Client read() FSDataInputStream DFSInputStream B1 (R1N1, R2N1, R2N2) B2 (R1N1, R2N3, R2N4) Name Node RIN2 JVM Data streamed to client directly from data node. DFSIS connects to R1N1 to read block B2 R1N1 • Next DFSIS attempts to read block B2. As mentioned earlier, the previous connection is closed and a fresh connection is made to the closest node (R1N1) of block B2.
  • 11. HDFS Client read() FSDataInputStream close() DFSInputStream Name Node B3 (R1N1, R2N1, R2N3) B3 (R1N1, R2N1, R2N3) RIN2 JVM Data streamed to client directly from data node. DFSIS connects to R1N1 to read block B3 R1N1 • Now DFSIS has read all blocks returned by the first RPC call (B1 & B2). But the file is not read completely. In our case there is one more block to read. • DFSIS calls name node to get data node locations for next batch of blocks as needed. • After the complete file is read for the HDFS client call close().
  • 12. ANATOMY OF FILE READ – DATA NODE CONNECTION ERROR
  • 13. HDFS Client read() FSDataInputStream DFSInputStream R1N1 Name Node B1 (R1N1, R2N1, R2N2) B2 (R1N1, R2N3, R2N4) RIN2 JVM Data streamed to client directly from data node. DFSIS connects to R1N1 to read block B2 R1N1 R2N3 • Let’s say there is some error while connecting to R1N1. • DFSIS remembers this info, so it won’t try to read from R1N1 for future blocks. Then it tries to connect to next closest node (R2N3).
  • 14. ANATOMY OF FILE READ – DATA NODE CHECKSUM ERROR
  • 15. HDFS Client read() FSDataInputStream DFSInputStream Inform name node that the block in R1N1 is corrupt. B1 (R1N1, R2N1, R2N2) B2 (R1N1, R2N3, R2N4) Name Node RIN2 JVM Data streamed to client directly from data node. DFSIS connects to R1N1 to read block B2 R1N1 R2N3 • Let’s say there is a checksum error. This means the block is corrupt. • Information about this corrupt block is sent to name node. Then DFSIS tries to connect to next closest node (R2N3).
  • 16. THE END SORRY FOR MY POOR ENGLISH.  PLEASE SEND YOUR VALUABLE FEEDBACK TO RAJESH_1290K@YAHOO.COM
  翻译: