ๅฐŠๆ•ฌ็š„ ๅพฎไฟกๆฑ‡็Ž‡๏ผš1ๅ†† โ‰ˆ 0.046166 ๅ…ƒ ๆ”ฏไป˜ๅฎๆฑ‡็Ž‡๏ผš1ๅ†† โ‰ˆ 0.046257ๅ…ƒ [้€€ๅ‡บ็™ปๅฝ•]
SlideShare a Scribd company logo
Role of Data cleaning in Data
Warehouse
Presentation on
Ramakant Soni
Assistant Professor, BKBIET, Pilani
ramakant.soni@bkbiet.ac.in
What is Data Warehouse ?
Data warehouse is an information delivery system where we can integrate and
transform data into information used largely for strategic decision making. The
historic data in the enterprise from various operational systems is collected and
is clubbed with other relevant data from outside sources to make integrated
data as content of data warehouse.
What is Data Cleaning ?
Data cleaning, also called data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to improve the quality
of data.
๏ถ Introduction
RAMAKANT SONI, BKBIET
๏ถ Steps to build Data Warehouse: ETL Process
Figure 1. ETL Process
RAMAKANT SONI, BKBIET
๏ถ Need of Data Cleaning
โ€ข Data warehouses require and provide extensive support for data cleaning.
โ€ข They load and continuously refresh huge amounts of data from a variety of
sources so the probability of โ€œdirty dataโ€ is high.
โ€ข Data warehouses are used for decision making, so the correctness of data
is vital to avoid wrong conclusions.
RAMAKANT SONI, BKBIET
๏ถ Requirements
A data cleaning approach should satisfy several requirements:
โ€ข Detect and remove all major errors and inconsistencies both in individual
data sources and when integrating multiple sources. The approach should
be supported by tools to limit manual inspection and programming effort.
โ€ข Data cleaning should not be performed in isolation but together with
schema-related data transformations based on comprehensive metadata.
โ€ข Mapping functions should be specified in a declarative way for data
cleaning and be reusable for other data sources as well as for query
processing.
โ€ข A workflow infrastructure should be supported to execute all data
transformation steps for multiple sources and large data sets in a reliable
and efficient way.
RAMAKANT SONI, BKBIET
๏ถ Data Quality Problems
RAMAKANT SONI, BKBIET
๏ถ Single-source problems
The data quality of a source largely depends on the degree to which it is governed by
schema and integrity constraints controlling permissible data values.
โ€ข Sources without schema, such as files, have few restrictions on what data can be
entered and stored, giving rise to a high probability of errors and inconsistencies.
โ€ข Database systems, enforce restrictions of a specific data model (e.g., the relational
approach requires simple attribute values, referential integrity, etc.) as well as
application-specific integrity constraints.
Schema-Level problems occur because of the lack of appropriate model-specific or
application-specific integrity constraints.
Instance-Level problems relate to errors and inconsistencies that cannot be prevented
at the schema level (e.g., misspellings).
RAMAKANT SONI, BKBIET
๏ถ Example: Single Source Problem
RAMAKANT SONI, BKBIET
๏ถ Multi-source problems
The problems in single sources are aggravated when multiple sources are integrated.
Each source may contain dirty data and the data in the sources may be represented
differently, overlap or contradict because of the independent sources.
Result: Large degree of heterogeneity.
Problem in cleaning: To identify overlapping data, in particular matching records
referring to the same real-world entity. This problem is also referred to as the object
identity problem, duplicate elimination problem.
Frequently, the information is only partially redundant and the sources may
complement each other by providing additional information about an entity.
Solution: duplicate information should be purged out and complementing information
should be consolidated and merged in order to achieve a consistent view of real world
entities.
RAMAKANT SONI, BKBIET
๏ถ Example: Multi-Source Problem
Figure 2. Multi-Source problem example
RAMAKANT SONI, BKBIET
๏ถ Data cleaning Phases
In general, data cleaning involves several phases:
โ€ข Data analysis
โ€ข Definition of transformation workflow and mapping rules
โ€ข Verification
โ€ข Transformation
โ€ข Backflow of cleaned data
RAMAKANT SONI, BKBIET
๏ถ Data cleaning process
Data analysis & Defining
transformation workflow,
mapping rules
Verification &
Transformation
Backflow of
cleaned data
Figure 3. Data Cleaning Process
RAMAKANT SONI, BKBIET
๏ถ Data cleaning Tool support
Large variety of tools is available to support data transformation and data cleaning:
โ€ข Data analysis Tools
1. Data profiling tool Eg. MigrationArchitect( Evoke Software)
2. Data mining tool Eg. WizRule( WizSoft)
โ€ข Data reengineering tools uses discovered patterns and rules for cleaning.
Eg. Integrity( Vality Software)
โ€ข Specialized cleaning tools deal with Particular Domain
1. Special Domain Cleaning Eg. IDCentric( FirstLogic)
2. Duplicate Elimination Eg. MatchIt( HelpItSystems)
โ€ข ETL tools uses repository built on DBMS to manage all metadata about data sources,
target schema, mapping script etc. in uniform way
Eg. Extract( ETI), CopyManager( InformationBuilders)
RAMAKANT SONI, BKBIET
๏ถ References
1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do-
University of Leipzig
2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing -
Shridhar B. Dandin- BKBIET Pilani
3. Principles and methods of data cleaning- Arthur D. Chapman
RAMAKANT SONI, BKBIET
Thank You
RAMAKANT SONI, BKBIET

More Related Content

What's hot

Web mining
Web mining Web mining
Web mining
TeklayBirhane
ย 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
DataminingTools Inc
ย 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
King Julian
ย 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
AbdullahAbbasi55
ย 
Database Management System ppt
Database Management System pptDatabase Management System ppt
Database Management System ppt
OECLIB Odisha Electronics Control Library
ย 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Acad
ย 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
Sunita Sahu
ย 
Components of a Data-Warehouse
Components of a Data-WarehouseComponents of a Data-Warehouse
Components of a Data-Warehouse
Abdul Aslam
ย 
Data mining
Data miningData mining
Data mining
Birju Tank
ย 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
ย 
Logical database design and the relational model(database)
Logical database design and the relational model(database)Logical database design and the relational model(database)
Logical database design and the relational model(database)
welcometofacebook
ย 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
Qubole
ย 
Types of Database Models
Types of Database ModelsTypes of Database Models
Types of Database Models
Murassa Gillani
ย 
Data mining
Data miningData mining
Data mining
Samir Sabry
ย 
Data warehouse 21 snowflake schema
Data warehouse 21 snowflake schemaData warehouse 21 snowflake schema
Data warehouse 21 snowflake schema
Vaibhav Khanna
ย 
data mining
data miningdata mining
data mining
manasa polu
ย 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
Harish Chand
ย 
All data models in dbms
All data models in dbmsAll data models in dbms
All data models in dbms
Naresh Kumar
ย 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
ย 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
vivekjv
ย 

What's hot (20)

Web mining
Web mining Web mining
Web mining
ย 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
ย 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
ย 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
ย 
Database Management System ppt
Database Management System pptDatabase Management System ppt
Database Management System ppt
ย 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
ย 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
ย 
Components of a Data-Warehouse
Components of a Data-WarehouseComponents of a Data-Warehouse
Components of a Data-Warehouse
ย 
Data mining
Data miningData mining
Data mining
ย 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ย 
Logical database design and the relational model(database)
Logical database design and the relational model(database)Logical database design and the relational model(database)
Logical database design and the relational model(database)
ย 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
ย 
Types of Database Models
Types of Database ModelsTypes of Database Models
Types of Database Models
ย 
Data mining
Data miningData mining
Data mining
ย 
Data warehouse 21 snowflake schema
Data warehouse 21 snowflake schemaData warehouse 21 snowflake schema
Data warehouse 21 snowflake schema
ย 
data mining
data miningdata mining
data mining
ย 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
ย 
All data models in dbms
All data models in dbmsAll data models in dbms
All data models in dbms
ย 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
ย 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
ย 

Viewers also liked

Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
Amir Masoud Sefidian
ย 
Data cleansing
Data cleansingData cleansing
Data cleansing
kunaljain1701
ย 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
InfoCheckPoint
ย 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Jennifer Morrow
ย 
Ontology-driven KDD Process Composition
Ontology-driven KDD Process CompositionOntology-driven KDD Process Composition
Ontology-driven KDD Process Composition
Emanuele Storti
ย 
14.machine learning
14.machine learning14.machine learning
14.machine learning
Abhijeet Kadam
ย 
26.docking
26.docking26.docking
26.docking
Abhijeet Kadam
ย 
7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene
CloudFixer
ย 
WEKA - A Data Mining Tool - by Shareek Ahamed
WEKA - A Data Mining Tool - by Shareek AhamedWEKA - A Data Mining Tool - by Shareek Ahamed
WEKA - A Data Mining Tool - by Shareek Ahamed
Shareek Ahamed
ย 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
Hassan Hussein
ย 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing Process
InfoCheckPoint
ย 
PTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB DesignPTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB Design
EMA Design Automation
ย 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
weka Content
ย 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
Alex Rayรณn Jerez
ย 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
Yanchang Zhao
ย 
Weka presentation
Weka presentationWeka presentation
Weka presentation
Saeed Iqbal
ย 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
ng8
ย 
Datacube
DatacubeDatacube
Datacube
man2sandsce17
ย 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
Revolution Analytics
ย 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
jagdish_93
ย 

Viewers also liked (20)

Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
ย 
Data cleansing
Data cleansingData cleansing
Data cleansing
ย 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
ย 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
ย 
Ontology-driven KDD Process Composition
Ontology-driven KDD Process CompositionOntology-driven KDD Process Composition
Ontology-driven KDD Process Composition
ย 
14.machine learning
14.machine learning14.machine learning
14.machine learning
ย 
26.docking
26.docking26.docking
26.docking
ย 
7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene
ย 
WEKA - A Data Mining Tool - by Shareek Ahamed
WEKA - A Data Mining Tool - by Shareek AhamedWEKA - A Data Mining Tool - by Shareek Ahamed
WEKA - A Data Mining Tool - by Shareek Ahamed
ย 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
ย 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing Process
ย 
PTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB DesignPTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB Design
ย 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
ย 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
ย 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
ย 
Weka presentation
Weka presentationWeka presentation
Weka presentation
ย 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
ย 
Datacube
DatacubeDatacube
Datacube
ย 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
ย 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
ย 

Similar to Role of Data Cleaning in Data Warehouse

Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
Dhilsath Fathima
ย 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
Dev EngineersSaathi
ย 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
Nathan Bijnens
ย 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
sumit621
ย 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
Y Parandama Reddy
ย 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
thomasmary607
ย 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
ย 
ijcatr04081001
ijcatr04081001ijcatr04081001
ijcatr04081001
reagan muriithi
ย 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
ย 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
ย 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
ย 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
DATAVERSITY
ย 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
Neo4j
ย 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
Costa Pissaris
ย 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
SpringPeople
ย 
Myth Busters: Iโ€™m Building a Data Lake, So I Donโ€™t Need Data Virtualization (...
Myth Busters: Iโ€™m Building a Data Lake, So I Donโ€™t Need Data Virtualization (...Myth Busters: Iโ€™m Building a Data Lake, So I Donโ€™t Need Data Virtualization (...
Myth Busters: Iโ€™m Building a Data Lake, So I Donโ€™t Need Data Virtualization (...
Denodo
ย 
Data Mesh
Data MeshData Mesh
Data Mesh
Piethein Strengholt
ย 
Data warehouse
Data warehouseData warehouse
Data warehouse
amna alhabib
ย 
Database :Introduction to Database System
Database :Introduction to Database SystemDatabase :Introduction to Database System
Database :Introduction to Database System
ZakriyaMalik2
ย 
Intro.pptx
Intro.pptxIntro.pptx
Intro.pptx
NithyasriA2
ย 

Similar to Role of Data Cleaning in Data Warehouse (20)

Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
ย 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
ย 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
ย 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
ย 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
ย 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
ย 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
ย 
ijcatr04081001
ijcatr04081001ijcatr04081001
ijcatr04081001
ย 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
ย 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
ย 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
ย 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
ย 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
ย 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
ย 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
ย 
Myth Busters: Iโ€™m Building a Data Lake, So I Donโ€™t Need Data Virtualization (...
Myth Busters: Iโ€™m Building a Data Lake, So I Donโ€™t Need Data Virtualization (...Myth Busters: Iโ€™m Building a Data Lake, So I Donโ€™t Need Data Virtualization (...
Myth Busters: Iโ€™m Building a Data Lake, So I Donโ€™t Need Data Virtualization (...
ย 
Data Mesh
Data MeshData Mesh
Data Mesh
ย 
Data warehouse
Data warehouseData warehouse
Data warehouse
ย 
Database :Introduction to Database System
Database :Introduction to Database SystemDatabase :Introduction to Database System
Database :Introduction to Database System
ย 
Intro.pptx
Intro.pptxIntro.pptx
Intro.pptx
ย 

More from Ramakant Soni

GATE 2021 Exam Information
GATE 2021 Exam InformationGATE 2021 Exam Information
GATE 2021 Exam Information
Ramakant Soni
ย 
What is Algorithm - An Overview
What is Algorithm - An OverviewWhat is Algorithm - An Overview
What is Algorithm - An Overview
Ramakant Soni
ย 
Internet of things
Internet of thingsInternet of things
Internet of things
Ramakant Soni
ย 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
ย 
Huffman and Arithmetic coding - Performance analysis
Huffman and Arithmetic coding - Performance analysisHuffman and Arithmetic coding - Performance analysis
Huffman and Arithmetic coding - Performance analysis
Ramakant Soni
ย 
UML daigrams for Bank ATM system
UML daigrams for Bank ATM systemUML daigrams for Bank ATM system
UML daigrams for Bank ATM system
Ramakant Soni
ย 
Collaboration diagram- UML diagram
Collaboration diagram- UML diagram Collaboration diagram- UML diagram
Collaboration diagram- UML diagram
Ramakant Soni
ย 
Activity diagram-UML diagram
Activity diagram-UML diagramActivity diagram-UML diagram
Activity diagram-UML diagram
Ramakant Soni
ย 
Sequence diagram- UML diagram
Sequence diagram- UML diagramSequence diagram- UML diagram
Sequence diagram- UML diagram
Ramakant Soni
ย 
Class diagram- UML diagram
Class diagram- UML diagramClass diagram- UML diagram
Class diagram- UML diagram
Ramakant Soni
ย 
Use Case diagram-UML diagram-2
Use Case diagram-UML diagram-2Use Case diagram-UML diagram-2
Use Case diagram-UML diagram-2
Ramakant Soni
ย 
Use Case diagram-UML diagram-1
Use Case diagram-UML diagram-1Use Case diagram-UML diagram-1
Use Case diagram-UML diagram-1
Ramakant Soni
ย 
UML Diagrams- Unified Modeling Language Introduction
UML Diagrams- Unified Modeling Language IntroductionUML Diagrams- Unified Modeling Language Introduction
UML Diagrams- Unified Modeling Language Introduction
Ramakant Soni
ย 

More from Ramakant Soni (13)

GATE 2021 Exam Information
GATE 2021 Exam InformationGATE 2021 Exam Information
GATE 2021 Exam Information
ย 
What is Algorithm - An Overview
What is Algorithm - An OverviewWhat is Algorithm - An Overview
What is Algorithm - An Overview
ย 
Internet of things
Internet of thingsInternet of things
Internet of things
ย 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
ย 
Huffman and Arithmetic coding - Performance analysis
Huffman and Arithmetic coding - Performance analysisHuffman and Arithmetic coding - Performance analysis
Huffman and Arithmetic coding - Performance analysis
ย 
UML daigrams for Bank ATM system
UML daigrams for Bank ATM systemUML daigrams for Bank ATM system
UML daigrams for Bank ATM system
ย 
Collaboration diagram- UML diagram
Collaboration diagram- UML diagram Collaboration diagram- UML diagram
Collaboration diagram- UML diagram
ย 
Activity diagram-UML diagram
Activity diagram-UML diagramActivity diagram-UML diagram
Activity diagram-UML diagram
ย 
Sequence diagram- UML diagram
Sequence diagram- UML diagramSequence diagram- UML diagram
Sequence diagram- UML diagram
ย 
Class diagram- UML diagram
Class diagram- UML diagramClass diagram- UML diagram
Class diagram- UML diagram
ย 
Use Case diagram-UML diagram-2
Use Case diagram-UML diagram-2Use Case diagram-UML diagram-2
Use Case diagram-UML diagram-2
ย 
Use Case diagram-UML diagram-1
Use Case diagram-UML diagram-1Use Case diagram-UML diagram-1
Use Case diagram-UML diagram-1
ย 
UML Diagrams- Unified Modeling Language Introduction
UML Diagrams- Unified Modeling Language IntroductionUML Diagrams- Unified Modeling Language Introduction
UML Diagrams- Unified Modeling Language Introduction
ย 

Recently uploaded

(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"
MJDuyan
ย 
How to Create User Notification in Odoo 17
How to Create User Notification in Odoo 17How to Create User Notification in Odoo 17
How to Create User Notification in Odoo 17
Celine George
ย 
Creating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptxCreating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptx
Forum of Blended Learning
ย 
220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology
Kalna College
ย 
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
220711130100 udita Chakraborty  Aims and objectives of national policy on inf...220711130100 udita Chakraborty  Aims and objectives of national policy on inf...
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
Kalna College
ย 
Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024
khabri85
ย 
Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024
Friends of African Village Libraries
ย 
What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17
Celine George
ย 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptx
heathfieldcps1
ย 
220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx
Kalna College
ย 
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
yarusun
ย 
Information and Communication Technology in Education
Information and Communication Technology in EducationInformation and Communication Technology in Education
Information and Communication Technology in Education
MJDuyan
ย 
How to Create a Stage or a Pipeline in Odoo 17 CRM
How to Create a Stage or a Pipeline in Odoo 17 CRMHow to Create a Stage or a Pipeline in Odoo 17 CRM
How to Create a Stage or a Pipeline in Odoo 17 CRM
Celine George
ย 
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
biruktesfaye27
ย 
Keynote given on June 24 for MASSP at Grand Traverse City
Keynote given on June 24 for MASSP at Grand Traverse CityKeynote given on June 24 for MASSP at Grand Traverse City
Keynote given on June 24 for MASSP at Grand Traverse City
PJ Caposey
ย 
220711130095 Tanu Pandey message currency, communication speed & control EPC ...
220711130095 Tanu Pandey message currency, communication speed & control EPC ...220711130095 Tanu Pandey message currency, communication speed & control EPC ...
220711130095 Tanu Pandey message currency, communication speed & control EPC ...
Kalna College
ย 
pol sci Election and Representation Class 11 Notes.pdf
pol sci Election and Representation Class 11 Notes.pdfpol sci Election and Representation Class 11 Notes.pdf
pol sci Election and Representation Class 11 Notes.pdf
BiplabHalder13
ย 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapitolTechU
ย 
Diversity Quiz Finals by Quiz Club, IIT Kanpur
Diversity Quiz Finals by Quiz Club, IIT KanpurDiversity Quiz Finals by Quiz Club, IIT Kanpur
Diversity Quiz Finals by Quiz Club, IIT Kanpur
Quiz Club IIT Kanpur
ย 
220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science
Kalna College
ย 

Recently uploaded (20)

(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"
ย 
How to Create User Notification in Odoo 17
How to Create User Notification in Odoo 17How to Create User Notification in Odoo 17
How to Create User Notification in Odoo 17
ย 
Creating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptxCreating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptx
ย 
220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology
ย 
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
220711130100 udita Chakraborty  Aims and objectives of national policy on inf...220711130100 udita Chakraborty  Aims and objectives of national policy on inf...
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
ย 
Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024
ย 
Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024
ย 
What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17
ย 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptx
ย 
220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx
ย 
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
ย 
Information and Communication Technology in Education
Information and Communication Technology in EducationInformation and Communication Technology in Education
Information and Communication Technology in Education
ย 
How to Create a Stage or a Pipeline in Odoo 17 CRM
How to Create a Stage or a Pipeline in Odoo 17 CRMHow to Create a Stage or a Pipeline in Odoo 17 CRM
How to Create a Stage or a Pipeline in Odoo 17 CRM
ย 
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
ย 
Keynote given on June 24 for MASSP at Grand Traverse City
Keynote given on June 24 for MASSP at Grand Traverse CityKeynote given on June 24 for MASSP at Grand Traverse City
Keynote given on June 24 for MASSP at Grand Traverse City
ย 
220711130095 Tanu Pandey message currency, communication speed & control EPC ...
220711130095 Tanu Pandey message currency, communication speed & control EPC ...220711130095 Tanu Pandey message currency, communication speed & control EPC ...
220711130095 Tanu Pandey message currency, communication speed & control EPC ...
ย 
pol sci Election and Representation Class 11 Notes.pdf
pol sci Election and Representation Class 11 Notes.pdfpol sci Election and Representation Class 11 Notes.pdf
pol sci Election and Representation Class 11 Notes.pdf
ย 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
ย 
Diversity Quiz Finals by Quiz Club, IIT Kanpur
Diversity Quiz Finals by Quiz Club, IIT KanpurDiversity Quiz Finals by Quiz Club, IIT Kanpur
Diversity Quiz Finals by Quiz Club, IIT Kanpur
ย 
220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science
ย 

Role of Data Cleaning in Data Warehouse

  • 1. Role of Data cleaning in Data Warehouse Presentation on Ramakant Soni Assistant Professor, BKBIET, Pilani ramakant.soni@bkbiet.ac.in
  • 2. What is Data Warehouse ? Data warehouse is an information delivery system where we can integrate and transform data into information used largely for strategic decision making. The historic data in the enterprise from various operational systems is collected and is clubbed with other relevant data from outside sources to make integrated data as content of data warehouse. What is Data Cleaning ? Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. ๏ถ Introduction RAMAKANT SONI, BKBIET
  • 3. ๏ถ Steps to build Data Warehouse: ETL Process Figure 1. ETL Process RAMAKANT SONI, BKBIET
  • 4. ๏ถ Need of Data Cleaning โ€ข Data warehouses require and provide extensive support for data cleaning. โ€ข They load and continuously refresh huge amounts of data from a variety of sources so the probability of โ€œdirty dataโ€ is high. โ€ข Data warehouses are used for decision making, so the correctness of data is vital to avoid wrong conclusions. RAMAKANT SONI, BKBIET
  • 5. ๏ถ Requirements A data cleaning approach should satisfy several requirements: โ€ข Detect and remove all major errors and inconsistencies both in individual data sources and when integrating multiple sources. The approach should be supported by tools to limit manual inspection and programming effort. โ€ข Data cleaning should not be performed in isolation but together with schema-related data transformations based on comprehensive metadata. โ€ข Mapping functions should be specified in a declarative way for data cleaning and be reusable for other data sources as well as for query processing. โ€ข A workflow infrastructure should be supported to execute all data transformation steps for multiple sources and large data sets in a reliable and efficient way. RAMAKANT SONI, BKBIET
  • 6. ๏ถ Data Quality Problems RAMAKANT SONI, BKBIET
  • 7. ๏ถ Single-source problems The data quality of a source largely depends on the degree to which it is governed by schema and integrity constraints controlling permissible data values. โ€ข Sources without schema, such as files, have few restrictions on what data can be entered and stored, giving rise to a high probability of errors and inconsistencies. โ€ข Database systems, enforce restrictions of a specific data model (e.g., the relational approach requires simple attribute values, referential integrity, etc.) as well as application-specific integrity constraints. Schema-Level problems occur because of the lack of appropriate model-specific or application-specific integrity constraints. Instance-Level problems relate to errors and inconsistencies that cannot be prevented at the schema level (e.g., misspellings). RAMAKANT SONI, BKBIET
  • 8. ๏ถ Example: Single Source Problem RAMAKANT SONI, BKBIET
  • 9. ๏ถ Multi-source problems The problems in single sources are aggravated when multiple sources are integrated. Each source may contain dirty data and the data in the sources may be represented differently, overlap or contradict because of the independent sources. Result: Large degree of heterogeneity. Problem in cleaning: To identify overlapping data, in particular matching records referring to the same real-world entity. This problem is also referred to as the object identity problem, duplicate elimination problem. Frequently, the information is only partially redundant and the sources may complement each other by providing additional information about an entity. Solution: duplicate information should be purged out and complementing information should be consolidated and merged in order to achieve a consistent view of real world entities. RAMAKANT SONI, BKBIET
  • 10. ๏ถ Example: Multi-Source Problem Figure 2. Multi-Source problem example RAMAKANT SONI, BKBIET
  • 11. ๏ถ Data cleaning Phases In general, data cleaning involves several phases: โ€ข Data analysis โ€ข Definition of transformation workflow and mapping rules โ€ข Verification โ€ข Transformation โ€ข Backflow of cleaned data RAMAKANT SONI, BKBIET
  • 12. ๏ถ Data cleaning process Data analysis & Defining transformation workflow, mapping rules Verification & Transformation Backflow of cleaned data Figure 3. Data Cleaning Process RAMAKANT SONI, BKBIET
  • 13. ๏ถ Data cleaning Tool support Large variety of tools is available to support data transformation and data cleaning: โ€ข Data analysis Tools 1. Data profiling tool Eg. MigrationArchitect( Evoke Software) 2. Data mining tool Eg. WizRule( WizSoft) โ€ข Data reengineering tools uses discovered patterns and rules for cleaning. Eg. Integrity( Vality Software) โ€ข Specialized cleaning tools deal with Particular Domain 1. Special Domain Cleaning Eg. IDCentric( FirstLogic) 2. Duplicate Elimination Eg. MatchIt( HelpItSystems) โ€ข ETL tools uses repository built on DBMS to manage all metadata about data sources, target schema, mapping script etc. in uniform way Eg. Extract( ETI), CopyManager( InformationBuilders) RAMAKANT SONI, BKBIET
  • 14. ๏ถ References 1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do- University of Leipzig 2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing - Shridhar B. Dandin- BKBIET Pilani 3. Principles and methods of data cleaning- Arthur D. Chapman RAMAKANT SONI, BKBIET
  ็ฟป่ฏ‘๏ผš