尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Prof. S. K. Pandey, I.T.S, Ghaziabad
Data Warehousing & Mining
UNIT – II
Prof. S.K. Pandey, I.T.S, Ghaziabad 2
Syllabus of Unit - II
 DATA Warehousing
 Data Warehousing Components
 Building a Data Warehouse
 Warehouse Database
 Mapping the Data Warehouse to a Multiprocessor
Architecture
 DBMS Schemas for Decision Support
 Data Extraction, Cleanup & Transformation Tools
 Metadata.
Prof. S.K. Pandey, I.T.S, Ghaziabad 3
Data Warehouse
• The Data warehouse is an environment, not a product.
• It is an architectural construct of an information system that
provides users with current and historical decision support
information that is hard to access or present in traditional
operational data store.
• Data warehousing is a blend of technologies and components
aimed at effective integration of operation database into an
environment that enables strategic use of data.
• These technologies include relational and multi-dimensional
database management system, client/ server architecture, meta-
data modeling and repositories, graphical user interface etc.
Prof. S.K. Pandey, I.T.S, Ghaziabad 4
Data Warehousing Components
Data Warehousing Components
 The data warehouse architecture is based on a relational
database management system server that functions as
the central repository for informational data. Operational
data and processing is completely separated from data
warehouse processing. This central information
repository is surrounded by a number of key
components designed to make the entire environment
functional, manageable and accessible by both the
operational systems that source data into the warehouse
and by end-user query and analysis tools.
Prof. S.K. Pandey, I.T.S, Ghaziabad 5
Components of Data Warehouse continued…
 There are following seven components of a Data
Warehouse:
– Data Warehouse Database
– Sourcing, Acquisition, Cleanup and Transformation Tools
– Meta Data
– Access (Query) Tools
The query tool allows executives and other users real-time access to the
Data Warehouse database for query generation, result displays, reports and
data exports
– Data Marts
– Data Warehouse Administration and Management
– Information Delivery System
Prof. S.K. Pandey, I.T.S, Ghaziabad 6
Prof. S.K. Pandey, I.T.S, Ghaziabad 7
Components & Framework
1. Data Warehouse Database
The central data warehouse database is the cornerstone of the data warehousing
environment. Certain data warehouse attributes, such as very large database
size, ad hoc query processing and the need for flexible user view creation
including aggregates, multi-table joins and drill-downs, have become drivers for
different technological approaches to the data warehouse database. These
approaches include:
– Parallel relational database designs for scalability that include shared-memory,
shared disk, or shared-nothing models implemented on various multiprocessor
configurations (symmetric multiprocessors or SMP, massively parallel processors or
MPP, and/or clusters of uni- or multiprocessors).
– An innovative approach to speed up a traditional RDBMS by using new index
structures to bypass relational table scans.
– Multidimensional databases (MDDBs) that are based on proprietary database
technology. Multi-dimensional databases are designed to overcome any limitations
placed on the warehouse by the nature of the relational data model. MDDBs enable
on-line analytical processing (OLAP) tools that architecturally belong to a group of
data warehousing components jointly categorized as the data query, reporting,
analysis and mining tools.
Prof. S.K. Pandey, I.T.S, Ghaziabad 8
2. Sourcing, Acquisition, Cleanup and
Transformation Tools
The data sourcing, cleanup, transformation and migration tools
perform all of the conversions, summarizations, key changes,
structural changes and condensations needed to transform disparate
data into information that can be used by the decision support tool.
They produce the programs and control statements, including the
COBOL programs, MVS job-control language (JCL), UNIX
scripts, and SQL data definition language (DDL) needed to move
data into the data warehouse for multiple operational systems.
These tools also maintain the meta data. The functionality includes:
– Removing unwanted data from operational databases
– Converting to common data names and definitions
– Establishing defaults for missing data
– Accommodating source data definition changes
Prof. S.K. Pandey, I.T.S, Ghaziabad 9
Prof. S.K. Pandey, I.T.S, Ghaziabad 10
ETL Tools
 ETL tools are the equivalent of schema mappings in virtual
integration, but are more powerful
 Some of the Well Known ETL Tools
The most well known commercial tools are Ab Initio, IBM InfoSphere
DataStage, Informatica, Oracle Data Integrator and SAP Data Integrator.
There are several open source ETL tools, among others:
Apatar, CloverETL, Pentaho and Talend.
 Arbitrary pieces of code to take data from a source, convert it into
data for the warehouse:
– Import filters – read and convert from data sources
– Data Transformations – join, aggregate, filter, convert data
– De-duplication – finds multiple records referring to the same entity,
merges them
– Profiling – builds tables, histograms, etc. to summarize data
– Quality management – test against master values, known business rules,
constraints, etc.
3. Meta Data
Meta data is data about data that describes the data
warehouse. It is used for building, maintaining,
managing and using the data warehouse. Meta data
can be classified into:
– Technical meta data, which contains information about
warehouse data for use by warehouse designers and
administrators when carrying out warehouse development
and management tasks.
– Business meta data, which contains information that gives
users an easy-to-understand perspective of the information
stored in the data warehouse.
Prof. S.K. Pandey, I.T.S, Ghaziabad 11
4. Access (Query) Tools
Query and Reporting tools can be divided into two groups:
– Reporting Tools and Managed Query Tools
– Reporting tools can be further divided into production
reporting tools and report writers.
 Production reporting tools let companies generate regular
operational reports or support high-volume batch jobs such as
calculating and printing paychecks.
 Report writers, on the other hand, are inexpensive desktop tools
designed for end-users.
– Managed query tools shield end users from the complexities
of SQL and database structures by inserting a meta-layer
between users and the database. These tools are designed for
easy-to-use, point-and-click operations that either accept SQL
or generate SQL database queries.
Prof. S.K. Pandey, I.T.S, Ghaziabad 12
5. Data Mart
 the term data mart means different things to
different people. A rigorous definition of this term
is a data store that is subsidiary to a data warehouse
of integrated data. The data mart is directed at a
partition of data (often called a subject area) that is
created for the use of a dedicated group of users.
These could be classified in two categories:
– Dependent Data Marts
– Independent Data Marts
Prof. S.K. Pandey, I.T.S, Ghaziabad 13
Prof. S.K. Pandey, I.T.S, Ghaziabad 14
Dependent Data Marts: These types of data marts, data is
sourced from the data warehouse, have a high value because no
matter how they are deployed and how many different enabling
technologies are used, different users are all accessing the
information views derived from the single integrated version of
the data.
Independent Data Marts: Unfortunately, the misleading
statements about the simplicity and low cost of data marts
sometimes result in organizations or vendors incorrectly
positioning them as an alternative to the data warehouse. This
viewpoint defines independent data marts that in fact, represent
fragmented point solutions to a range of business problems in the
enterprise. This type of implementation should be rarely deployed
in the context of an overall technology or applications
architecture. Indeed, it is missing the ingredient that is at the
heart of the data warehousing concept -- that of data integration.
Prof. S.K. Pandey, I.T.S, Ghaziabad 15
6. Data Warehouse Administration and Management
Managing data warehouses includes:
1. Security and priority management
2. Monitoring updates from the multiple sources
3. Data quality checks
4. Managing and updating meta data
5. Auditing and reporting data warehouse usage
and status
6. Purging data
7. Replicating, sub-setting and distributing data
8. Backup and Recovery and
9. Data warehouse storage management.
15
Prof. S.K. Pandey, I.T.S, Ghaziabad 16Prof. S.K. Pandey, I.T.S, Ghaziabad 16
7. Information Delivery System
• The information delivery component is used to enable the process of
subscribing for data warehouse information and having it delivered to one or
more destinations according to some user-specified scheduling algorithm.
• In other words, the information delivery system distributes warehouse-stored
data and other information objects to other data warehouses and end-user
products such as spreadsheets and local databases.
•Delivery of information may be based on time of day or on the completion of
an external event.
•The rationale for the delivery systems component is based on the fact that
once the data warehouse is installed and operational, its users don't have to be
aware of its location and maintenance.
16
Prof. S.K. Pandey, I.T.S, Ghaziabad 17
Building a Data Warehouse
Why a Data Warehouse Application – Business Perspectives
There are several reasons why organizations consider Data
Warehousing a critical need. From a business prospective, to
strive and succeed in today’s highly competetive global
environment, business users demand business answers mainly
because:
• Decisions need to be made quickly and correctly, using all available
data
• Users are business domain experts, not computer professionals
• The amount of data increasing in the data stores, which affects
response time and the sheer ability to comprehend its content.
• Competitions is heating up in the areas of business intelligence and
added information value.
Prof. S.K. Pandey, I.T.S, Ghaziabad 18Prof. S.K. Pandey, I.T.S, Ghaziabad 18
Building a Data Warehouse
Why a Data Warehouse Application – Technology Perspectives
• There are several technology reasons also for existence of Data
Warehousing.
• First, the Data Warehouse is designed to address the incompatibility of
informational and operational transactional systems. These two classes of
information systems are designed to satisfy different , often incompatible,
requirements.
• Secondly, the IT infrastructure is changing rapidly, and its capabilities are
increasing, as evidenced by the following:
• The prices of MIPS continues to decline, while the power of processors
doubles every 2 years
• The prices of digital storage is rapidly dropping
• Network bandwidth is increasing, while the price of high bandwidth is
decreasing
• The workplace is increasingly heterogeneous with respect to both the
hardware and software
• Legacy systems need to, and can, be integrated with new applications
Prof. S.K. Pandey, I.T.S, Ghaziabad 19Prof. S.K. Pandey, I.T.S, Ghaziabad 19Prof. S.K. Pandey, I.T.S, Ghaziabad 19
Building a Data Warehouse
1. Business Considerations (Return on Investment)
2. Design Considerations
3. Technical Considerations
4. Implementation Considerations
5. Integrated Solutions
6. Benefits of Data Warehousing
Prof. S.K. Pandey, I.T.S, Ghaziabad 20Prof. S.K. Pandey, I.T.S, Ghaziabad 20Prof. S.K. Pandey, I.T.S, Ghaziabad 20Prof. S.K. Pandey, I.T.S, Ghaziabad 20
Building a Data Warehouse Contd..
1. Business Considerations (Return on Investment)
1. Approach
• The Top-down Approach, meaning that an organization has
developed an enterprise data model, collected enterprise-wide business
requirements, and decided to build an enterprise data warehouse with
subset data marts.
• The Bottom-up Approach, implying that the business priorities
resulted in developing individual data marts, which are then integrated
into enterprise data warehouse.
2. Organizational Issues
A Data Warehouse, in general, is not truly a technological issue, rather, it
should be more concerned with identifying and establishing information
requirements, the data sources to fulfill these requirements, and timeliness.
Prof. S.K. Pandey, I.T.S, Ghaziabad 21Prof. S.K. Pandey, I.T.S, Ghaziabad 21Prof. S.K. Pandey, I.T.S, Ghaziabad 21Prof. S.K. Pandey, I.T.S, Ghaziabad 21Prof. S.K. Pandey, I.T.S, Ghaziabad 21
Building a Data Warehouse Contd..
2. Design Consideration
To be a successful, a data warehouse designer must take a
holistic approach – consider all data warehouse components as
parts of a single complex system and take into the account all
possible data stores and all known usage requirements. Failing
to do so may easily result in a data warehouse design that is
skewed toward a particular business requirement, a particular
data sources, or a selected access tool. This is also one of the
reasons why a data warehouse is rather difficult to build. The
main factors include:
• Heterogeneity of Data sources, which affects data conversion,
quality, timeliness
• Use of historical data, while implies that data may be “old”.
• Tendency of databases to grow very large
Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22
Building a Data Warehouse Contd..
2. Design Consideration - In addition to the general considerations,
there are several specific points relevant to the data warehouse
design:
• Data Content
• Metadata
• Data Distribution
One of the biggest challenge when designing a data warehouse is the data
placement and distribution strategy.
• Tools
These tools provide facilities for defining the transformation and cleanup
rules, data movement (from operational sources to the warehouses, end-user
query, reporting, and data analysis.
• Performance consideration
Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23
Building a Data Warehouse Contd..
3. Technical Considerations
A number of technical issues are to be considered when
designing and implementing a Data Warehouse environment.
1. The Hardware Platform that would house the Data Warehouse for
parallel query scalability. (Uni-Processor, Multi-processor, etc)
2. The DBMS that supports the warehouse database
3. The communication infrastructure that connects the warehouse, data
marts, operational systems, and end users
4. The hardware platform and software to support the metadata
repository
5. The systems management framework that enables centralized
management and administration to the entire environment.
Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24
Building a Data Warehouse Contd..
4. Implementation Considerations
i. Access Tools
Currently no single tool in the market can handle all possible data warehouse
access needs. Therefore, most implementations rely on a suite of tools.
Examples of Access types include:
a. Simple Tabular for reporting
b. Ranking
c. Multi-variable Analysis
d. Time Series Analysis
e. Data Visualization, Graphing, Charting and pivoting
f. Complex Textual Search
g. Statistical Analysis
h. AI Techniques for testing of hypothesis, trends discovery, definition,
validation of Data Clusters and segments
i. Information Mapping (i.e. mapping of Spatial Data in geographic information systems)
j. Ad-hoc User Specified Queries
k. Pre-defined repeatable queries
l. Interactive drill-down reporting and analysis
m. Complex queries with multiple joins, multi-level subquesries, and sophisticated
search criteria.
Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 2525
Building a Data Warehouse Contd..
4. Implementation Considerations
ii. Data Extraction, Cleanup, Transformation, and Migration
As a components of the Data Warehouse architecture, proper attention must be given to
Data Extraction, which represents a critical success factor for a data warehouse
architecture.
1. The ability to identify data in the data source environments that can be read by
conversion tool is important. This additional step may affect the timeliness of data
delivery to the warehouse.
2. Support for the flat files. (VSAM, ISM, IDMS) is critical, since bulk of the corporate
data is still maintained in this type of data storage.
3. The capability to merge data from multiple data stores is required in many
installations.
4. The specification interface to indicate the data to extracted and the conversion criteria
is important.
5. The ability to read information from data dictionaries or import information from
repository product is desired.
6. The ability to perform data-type and character-set translation is a requirement when
moving data moving between incompatible systems.
7. The capability to create summarization, aggregation, and derivation records and fields
is very important.
26Prof. S.K. Pandey, I.T.S, Ghaziabad 2626262626262626
Building a Data Warehouse Contd..
4. Implementation Considerations
iii. Data Placement Strategies
As Data Warehouse grows, there are at least two options for Data
Placement. One is to put some of the data in the data warehouse
into another storage media (WORM, RAID). Second option is to
distribute data in data warehouse across multiple servers. Some
criteria must be established for dividing it over the servers – by
geography, organization unit, time, function, etc. However, the
data is divided, a single source of meta data across the entire
organization is required. Hence this configuration requires both
corporation-wide and the meta data managed for any given server.
Prof. S.K. Pandey, I.T.S, Ghaziabad 27272727272727272727
Building a Data Warehouse Contd..
4. Implementation Considerations
iv. Metadata
A frequently occurring problem in Data Warehouse is the
problem of communicating to the end user what
information resides in the data warehouse and how it can be
accessed. The key to providing users and applications with
a roadmap to the information stored in the warehouse is the
metadata. It can define all data elements and their
attributes, data sources and timing, and the rules that govern
data use and data transformations. Meta data needs to be
collected as the warehouse is designed and built.
Prof. S.K. Pandey, I.T.S, Ghaziabad 2828Prof. S.K. Pandey, I.T.S, Ghaziabad 2828282828282828
Building a Data Warehouse Contd..
4. Implementation Considerations
v. User Sophistication Levels
Data Warehousing is relatively new phenomenon, and a certain
degree of sophistication is required on the end user’s part to
effectively use the warehouse. The users can be classified on the
basis of their skill level in accessing the warehouse:
1. Casual Users: These users are most comfortable retrieving information
from the warehouse in pre-defined formats, and running preexisting queries and
reports.
2. Power Users: In their delay activities, these users typically combine
predefined queries with some relatively simple and ad-hoc queries that they
create themselves. These users need access tools that combine the simplicity of
pre-defined queries and reports with a certain degree of flexibility.
3. Experts: These users tend to create their own queries and perform
sophisticated analysis on the information they retrieve from the warehouse.
These users know the data, tools and database well enough to demand tools that
allow for maximum flexibility and adaptability.
29292929292929
Benefits of Data Warehouse
Successfully implemented data warehousing can realize some significance
benefits which can be categorized in two categories:
1. Tangible Benefits:
1. Product inventory turnover is improved
2. Costs of product introduction are decreased with improved selection of
target markets.
3. More cost effective decision making is enabled by separating (ad-hoc) query
processing from running against operational database.
4. Better business intelligence is enabled by increased quality and flexibility of
market analysis available through multi-level data structures, which may range
from detailed to highly summarized.
2. Intangible Benefits:
1. Improved productivity
2. Reduced redundant processing, support, and software to support
overlapping decision support applications
3. Enhanced Customer relations through improved knowledge of individual
requirements and trends, through customization, improved communications,
and tailored product offerings.
4. Enabling business process reengineering – data warehousing can provide
useful insights into work process themselves,
Prof. S.K. Pandey, I.T.S, Ghaziabad
Prof. S.K. Pandey, I.T.S, Ghaziabad 30
Warehouse Database
 The organizations that embarked on data warehousing
development deal with ever increasing amounts of data. Generally
speaking, the size of a data warehouse rapidly approaches the point
where the search for better performance and scalability becomes a
real necessity. This search aims to pursue two goals:
– Speed-up: the ability to execute the same request on the same
amount data in less time
– Scale-up: the ability to obtain the same performance on the
same request as the database size increases.
An additional and important goal is to achieve linear speed-up and scale-up,
doubling the number of processors cuts the response time in half (linear
speed-up) or provides the same performance on twice as much data (linear
scale-up).
Prof. S.K. Pandey, I.T.S, Ghaziabad 31
Mapping the Data Warehouse to a
Multiprocessor Architecture
 The goals of linear performance and scalability (discussed in
previous slide) can be satisfied by parallel hardware
architectures, parallel operating systems, and parallel DBMSs.
Parallel hardware architectures are based on Multi-processor
systems designed as a Shared-memory model (symmetric
multiprocessors), Shared-disk model or distributed-memory
model (MPP and Clusters of SMPs). Parallelism can be achieved
in two different ways:
– Horizontal Parallelism (Database is partitioned across different disks)
– Vertical Parallelism (occurs among different tasks – all components query
operations i.e. scans, join, sort)
– Data Partitioning
Prof. S.K. Pandey, I.T.S, Ghaziabad 32Prof. S.K. Pandey, I.T.S, Ghaziabad 32
Database Architectures for Parallel Processing
 Shared-memory Architecture
 Shared Disk Architecture
 Shared-nothing Architecture
 Combined Architecture
Prof. S.K. Pandey, I.T.S, Ghaziabad 33Prof. S.K. Pandey, I.T.S, Ghaziabad 33Prof. S.K. Pandey, I.T.S, Ghaziabad 33
Parallel RDBMS Features
 Data Warehouse development requires a good understanding of all
architectural components, including the data warehouse DBMS
Platform. Understanding the basic architecture of Warehouse
database is the first step in evaluating and selecting a product.
 State of the art parallel features the developers and users of the
Warehouse should demand from the DBMS vendor:
 Scope and techniques of Parallel DBMS
 Queries (Insert/ Update/Delete)
 DBMS that supports parallel database load, backup, reorganization
and recovery is much better positioned for VLDBs.
 Optimizer Implementation
 Application Transparency
 The Parallel environment
 DBMS Management Tools
 Price/ Performance
Prof. S.K. Pandey, I.T.S, Ghaziabad 34Prof. S.K. Pandey, I.T.S, Ghaziabad 34Prof. S.K. Pandey, I.T.S, Ghaziabad 34Prof. S.K. Pandey, I.T.S, Ghaziabad 34
Parallel DBMS Vendors
 ORACLE – Oracle supports Parallel Database processing with its add-on
Oracle Parallel Server Option (OPS) and Parallel Query Option (PQO) with
Query Coordinator.
 Informix – Informix developed its Dynamic Scalable Architecture (DSA) to
support Shared-Memory, Shared-Disk, and Shared-Nothing Models. Informix
OnLine release 8, also known as XPS (eXtended Parallel Server), supports MPP
Hardware platforms that include IBM, SP, AT & T, Sun, HP, ICL Goldrush, with
sequent, Siemens, Pyramid etc.
 IBM – DB2 Parallel Edition (DB2 PE), a Database based on DB2/6000 Server
Architecture; latest version is DB2 Universal Database.
 Sybase – Sybase implemented its parallel DBMS functionality in a product
called SYBASE MPP (formerly Navigational Server). It was jointly developed by
Sybase and NCR (formerly AT&T GIS), and its first release was targeted for the
AT&T 3400, 3500 (both SMP) and 3600 (MPP) Platforms.
 Other RDBMS Products i. NCR Teradata ii. Tandem NonStop SQL/MP
 Specialized Database Products - i. Red Brick Systems
ii. White Cross Systems Inc.
Prof. S.K. Pandey, I.T.S, Ghaziabad 35
DBMS Schemas for Decision
Support
 Data Warehousing projects were forced to choose
between a data model and a corresponding database
schema that is intuitive for analysis but performs poorly
and a model-schema that performs better but is not well
suited for analysis.
 As Data Warehousing continued to mature, new
approaches to schema design resulted in schemas better
suited to business analysis that is so crucial to
successful data warehousing.
 The schema methodology that is gaining widespread
acceptance for Data Warehousing is the Star Schema.
Prof. S.K. Pandey, I.T.S, Ghaziabad 36Prof. S.K. Pandey, I.T.S, Ghaziabad 36
Data Layout for best Access
 The original objective in developing an abstract model known as
Relational Model were to address a number of shortcomings of
non-relational DBMS and application development.
 The typical requirements for the RDBMS supporting operational
systems are based on the need to effectively support a large
number of small but simultaneous read and write requests.
 The demand placed on the RDBMS by a Data Warehouse are very
different. A data warehouse RDBMS typically needs to process
queries that are large, complex, ad-hoc and data intensive.
 Solving modern business problems such as market analysis and
financial forecasting requires query-centric database schemas that
are array-oriented and multi-dimensional in nature.
Prof. S.K. Pandey, I.T.S, Ghaziabad 37Prof. S.K. Pandey, I.T.S, Ghaziabad 37
Multi-dimensional Data Model
 The Multi-dimensional nature of business questions
is reflected in the fact that, for example, marketing
managers are no longer satisfied by asking simple
one-dimensional questions such as “How much
revenue did the new product generate by month, in
northeastern division, broken down by user
demographic, by sales office, relative to the previous
version of the product, compared with the plan?” – a
six dimensional question.
Prof. S.K. Pandey, I.T.S, Ghaziabad 38Prof. S.K. Pandey, I.T.S, Ghaziabad 38Prof. S.K. Pandey, I.T.S, Ghaziabad 38
STAR SCHEMA
 The Multi-dimensional view of Data that is expressed using
relational database semantics is provided by the database
schema design called Star Schema.
 The basic premise of Star Schema is that information can be
classified into two groups: facts and dimensions.
 Facts are the core Data element being analyzed. For example,
units of individual items sold are facts.
 Dimensions are attributes about the facts. For example,
dimensions are the product types purchased and date of
purchase.
Prof. S.K. Pandey, I.T.S, Ghaziabad 39
Data Extraction, Cleanup &
Transformation Tools
 The task of capturing data from a source data system,
cleaning and transforming it and then loading the results into
a target data system can be carried out either by separate
products, or by a single integrated solution. More
contemporary integrated solutions can fall into one of the
categories described below:
– Code Generators
– Database data Replications
– Rule-driven Dynamic Transformation Engines (Data Mart
Builders)
Prof. S.K. Pandey, I.T.S, Ghaziabad 40Prof. S.K. Pandey, I.T.S, Ghaziabad 40
Code Generator
– It creates 3GL/4GL transformation programs based on source
and target data definitions, and data transformation and
enhancement rules defined by the developer.
– This approach reduces the need for an organization to write
its own data capture, transformation, and load programs.
These products employ DML Statements to capture a set of
the data from source system.
– These are used for data conversion projects, and for building
an enterprise-wide data warehouse, when there is a significant
amount of data transformation to be done involving a variety
of different flat files, non-relational, and relational data
sources.
Prof. S.K. Pandey, I.T.S, Ghaziabad 41Prof. S.K. Pandey, I.T.S, Ghaziabad 41Prof. S.K. Pandey, I.T.S, Ghaziabad 41
Database Data Replication Tools
– These tools employ database triggers or a recovery log to
capture changes to a single data source on one system and
apply the changes to a copy of the data source data located on
a different system.
– Most replication products do not support the capture of
changes to non-relational files and databases, and often do not
provide facilities for significant data transformation and
enhancement.
– These point-to-point tools are used for disaster recovery and
to build an operational data store, a data warehouse, or a data
mart when the number of data sources involved are small and
a limited amount of data transformation and enhancement is
required.
Prof. S.K. Pandey, I.T.S, Ghaziabad 42Prof. S.K. Pandey, I.T.S, Ghaziabad 42Prof. S.K. Pandey, I.T.S, Ghaziabad 42Prof. S.K. Pandey, I.T.S, Ghaziabad 42
Rule-driven Dynamic Transformation
Engines
– They are also known as Data Mart Builders and capture data from a
source system at User-defined intervals, transform data, and then send
and load the results into a target environment, typically a data mart.
– To date most of the products of this category support only relational
data sources, though now this trend have started changing.
– Data to be captured from source system is usually defined using query
language statements, and data transformation and enhancement is
done on a script or a function logic defined to the tool.
– With most tools in this category, data flows from source systems to
target systems through one or more servers, which perform the data
transformation and enhancement. These transformation servers can
usually be controlled from a single location, making the job of such
environment much easier.

More Related Content

What's hot

Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
janani thirupathi
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
Yogendra Uikey
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
Seerat Malik
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
Motaz Saad
 
Data Mining
Data MiningData Mining
Data Mining
SHIKHA GAUTAM
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
thomasmary607
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
Lovely Professional University
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
mekuanint sefi
 
Data analytics vs. Data analysis
Data analytics vs. Data analysisData analytics vs. Data analysis
Data analytics vs. Data analysis
Dr. C.V. Suresh Babu
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
Data warehousing
Data warehousingData warehousing
Data warehousing
Anshika Nigam
 
1.4 data warehouse
1.4 data warehouse1.4 data warehouse
1.4 data warehouse
Krish_ver2
 
Business Analytics
 Business Analytics  Business Analytics
Business Analytics
ICFAI Business School
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
idnats
 
Distributed database
Distributed databaseDistributed database
Distributed database
ReachLocal Services India
 
Advanced DBMS presentation
Advanced DBMS presentationAdvanced DBMS presentation
Advanced DBMS presentation
Hindustan Petroleum
 
Ppt
PptPpt
Data mining notes
Data mining notesData mining notes
Data mining notes
AVC College of Engineering
 

What's hot (20)

Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
Data analytics vs. Data analysis
Data analytics vs. Data analysisData analytics vs. Data analysis
Data analytics vs. Data analysis
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
1.4 data warehouse
1.4 data warehouse1.4 data warehouse
1.4 data warehouse
 
Business Analytics
 Business Analytics  Business Analytics
Business Analytics
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Distributed database
Distributed databaseDistributed database
Distributed database
 
Advanced DBMS presentation
Advanced DBMS presentationAdvanced DBMS presentation
Advanced DBMS presentation
 
Ppt
PptPpt
Ppt
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 

Viewers also liked

Components of a Data-Warehouse
Components of a Data-WarehouseComponents of a Data-Warehouse
Components of a Data-Warehouse
Abdul Aslam
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Jason S
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)
Bikramjit Sarkar, Ph.D.
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehouse
kiran14360
 
Data warehousing
Data warehousingData warehousing
Data warehousing
Varun Jain
 
Improve the Health of Your Data
Improve the Health of Your DataImprove the Health of Your Data
Improve the Health of Your Data
RTTS
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
RTTS
 
introduction to data warehousing and mining
 introduction to data warehousing and mining introduction to data warehousing and mining
introduction to data warehousing and mining
Rajesh Chandra
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousing
Shahed Khalili
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Gurpreet Singh Sachdeva
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Eyad Manna
 
QuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solutionQuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solution
RTTS
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
Samir Sabry
 
Leveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE VerticaLeveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE Vertica
RTTS
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
Theju Paul
 
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
RTTS
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
Difference between star schema and snowflake schema
Difference between star schema and snowflake schemaDifference between star schema and snowflake schema
Difference between star schema and snowflake schema
Umar Ali
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data Quality
RTTS
 

Viewers also liked (20)

Components of a Data-Warehouse
Components of a Data-WarehouseComponents of a Data-Warehouse
Components of a Data-Warehouse
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)
 
introduction to datawarehouse
introduction to datawarehouseintroduction to datawarehouse
introduction to datawarehouse
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Improve the Health of Your Data
Improve the Health of Your DataImprove the Health of Your Data
Improve the Health of Your Data
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
 
introduction to data warehousing and mining
 introduction to data warehousing and mining introduction to data warehousing and mining
introduction to data warehousing and mining
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousing
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
QuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solutionQuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solution
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Leveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE VerticaLeveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE Vertica
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
 
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
Data Warehousing in Pharma: How to Find Bad Data while Meeting Regulatory Req...
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
Difference between star schema and snowflake schema
Difference between star schema and snowflake schemaDifference between star schema and snowflake schema
Difference between star schema and snowflake schema
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data Quality
 

Similar to Data Warehousing & Basic Architectural Framework

Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
Dhilsath Fathima
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
Dr. Sunil Kr. Pandey
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
SOMASUNDARAM T
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
Datavalley.ai
 
Unit 5
Unit 5 Unit 5
Unit 5
sasisanjeev
 
Data warehouse
Data warehouseData warehouse
Data warehouse
RajThakuri
 
Advanced Database System
Advanced Database SystemAdvanced Database System
Advanced Database System
sushmita rathour
 
Dwbasics
DwbasicsDwbasics
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Dr. Sunil Kr. Pandey
 
Data Mining
Data MiningData Mining
Data Mining
ksanthosh
 
DMDW 1st module.pdf
DMDW 1st module.pdfDMDW 1st module.pdf
DMDW 1st module.pdf
ShreyaBharadwaj7
 
Unit Ii
Unit IiUnit Ii
Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
MadhuriNigam1
 
Unit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdfUnit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdf
ShivarkarSandip
 
UNIT 1 DWDM.pdf
UNIT 1 DWDM.pdfUNIT 1 DWDM.pdf
UNIT 1 DWDM.pdf
Pabitha Chidambaram
 
DW 101
DW 101DW 101
DW 101
jeffd00
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
PalaniKumarR2
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
SumathiG8
 
J0212065068
J0212065068J0212065068
J0212065068
researchinventy
 
Unit 1
Unit 1Unit 1
Unit 1
DrPrabu M
 

Similar to Data Warehousing & Basic Architectural Framework (20)

Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
 
Unit 5
Unit 5 Unit 5
Unit 5
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Advanced Database System
Advanced Database SystemAdvanced Database System
Advanced Database System
 
Dwbasics
DwbasicsDwbasics
Dwbasics
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Data Mining
Data MiningData Mining
Data Mining
 
DMDW 1st module.pdf
DMDW 1st module.pdfDMDW 1st module.pdf
DMDW 1st module.pdf
 
Unit Ii
Unit IiUnit Ii
Unit Ii
 
Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
 
Unit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdfUnit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdf
 
UNIT 1 DWDM.pdf
UNIT 1 DWDM.pdfUNIT 1 DWDM.pdf
UNIT 1 DWDM.pdf
 
DW 101
DW 101DW 101
DW 101
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
 
J0212065068
J0212065068J0212065068
J0212065068
 
Unit 1
Unit 1Unit 1
Unit 1
 

More from Dr. Sunil Kr. Pandey

Cloud Security, Standards and Applications
Cloud Security, Standards and ApplicationsCloud Security, Standards and Applications
Cloud Security, Standards and Applications
Dr. Sunil Kr. Pandey
 
Virtualization for Cloud Environment
Virtualization for Cloud EnvironmentVirtualization for Cloud Environment
Virtualization for Cloud Environment
Dr. Sunil Kr. Pandey
 
Collaborating Using Cloud Services
Collaborating Using Cloud ServicesCollaborating Using Cloud Services
Collaborating Using Cloud Services
Dr. Sunil Kr. Pandey
 
Cloud Services: Types of Cloud
Cloud Services: Types of CloudCloud Services: Types of Cloud
Cloud Services: Types of Cloud
Dr. Sunil Kr. Pandey
 
Cloud Computing - Introduction
Cloud Computing - IntroductionCloud Computing - Introduction
Cloud Computing - Introduction
Dr. Sunil Kr. Pandey
 
Future Skills & Career Opportunities in POST COVID-19
Future Skills & Career Opportunities in POST COVID-19Future Skills & Career Opportunities in POST COVID-19
Future Skills & Career Opportunities in POST COVID-19
Dr. Sunil Kr. Pandey
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Digital India: Use of Technology For Transforming Society
Digital India: Use of Technology For Transforming SocietyDigital India: Use of Technology For Transforming Society
Digital India: Use of Technology For Transforming Society
Dr. Sunil Kr. Pandey
 
Mobile Technology – Historical Evolution, Present Status & Future Directions
Mobile Technology – Historical Evolution, Present Status & Future DirectionsMobile Technology – Historical Evolution, Present Status & Future Directions
Mobile Technology – Historical Evolution, Present Status & Future Directions
Dr. Sunil Kr. Pandey
 
Mobile Technology – Historical Evolution, Present Status & Future Directions
Mobile Technology – Historical Evolution, Present Status & Future DirectionsMobile Technology – Historical Evolution, Present Status & Future Directions
Mobile Technology – Historical Evolution, Present Status & Future Directions
Dr. Sunil Kr. Pandey
 
Green Commputing - Paradigm Shift in Computing Technology, ICT & its Applicat...
Green Commputing - Paradigm Shift in Computing Technology, ICT & its Applicat...Green Commputing - Paradigm Shift in Computing Technology, ICT & its Applicat...
Green Commputing - Paradigm Shift in Computing Technology, ICT & its Applicat...
Dr. Sunil Kr. Pandey
 
Digital India MIssion - An oveview
Digital India MIssion - An oveviewDigital India MIssion - An oveview
Digital India MIssion - An oveview
Dr. Sunil Kr. Pandey
 
Business Analysis, Query Tools, Dm unit-3
Business Analysis, Query Tools, Dm unit-3Business Analysis, Query Tools, Dm unit-3
Business Analysis, Query Tools, Dm unit-3
Dr. Sunil Kr. Pandey
 

More from Dr. Sunil Kr. Pandey (13)

Cloud Security, Standards and Applications
Cloud Security, Standards and ApplicationsCloud Security, Standards and Applications
Cloud Security, Standards and Applications
 
Virtualization for Cloud Environment
Virtualization for Cloud EnvironmentVirtualization for Cloud Environment
Virtualization for Cloud Environment
 
Collaborating Using Cloud Services
Collaborating Using Cloud ServicesCollaborating Using Cloud Services
Collaborating Using Cloud Services
 
Cloud Services: Types of Cloud
Cloud Services: Types of CloudCloud Services: Types of Cloud
Cloud Services: Types of Cloud
 
Cloud Computing - Introduction
Cloud Computing - IntroductionCloud Computing - Introduction
Cloud Computing - Introduction
 
Future Skills & Career Opportunities in POST COVID-19
Future Skills & Career Opportunities in POST COVID-19Future Skills & Career Opportunities in POST COVID-19
Future Skills & Career Opportunities in POST COVID-19
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Digital India: Use of Technology For Transforming Society
Digital India: Use of Technology For Transforming SocietyDigital India: Use of Technology For Transforming Society
Digital India: Use of Technology For Transforming Society
 
Mobile Technology – Historical Evolution, Present Status & Future Directions
Mobile Technology – Historical Evolution, Present Status & Future DirectionsMobile Technology – Historical Evolution, Present Status & Future Directions
Mobile Technology – Historical Evolution, Present Status & Future Directions
 
Mobile Technology – Historical Evolution, Present Status & Future Directions
Mobile Technology – Historical Evolution, Present Status & Future DirectionsMobile Technology – Historical Evolution, Present Status & Future Directions
Mobile Technology – Historical Evolution, Present Status & Future Directions
 
Green Commputing - Paradigm Shift in Computing Technology, ICT & its Applicat...
Green Commputing - Paradigm Shift in Computing Technology, ICT & its Applicat...Green Commputing - Paradigm Shift in Computing Technology, ICT & its Applicat...
Green Commputing - Paradigm Shift in Computing Technology, ICT & its Applicat...
 
Digital India MIssion - An oveview
Digital India MIssion - An oveviewDigital India MIssion - An oveview
Digital India MIssion - An oveview
 
Business Analysis, Query Tools, Dm unit-3
Business Analysis, Query Tools, Dm unit-3Business Analysis, Query Tools, Dm unit-3
Business Analysis, Query Tools, Dm unit-3
 

Recently uploaded

Slides Peluncuran Amalan Pemakanan Sihat.pptx
Slides Peluncuran Amalan Pemakanan Sihat.pptxSlides Peluncuran Amalan Pemakanan Sihat.pptx
Slides Peluncuran Amalan Pemakanan Sihat.pptx
shabeluno
 
What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17
Celine George
 
220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology
Kalna College
 
Accounting for Restricted Grants When and How To Record Properly
Accounting for Restricted Grants  When and How To Record ProperlyAccounting for Restricted Grants  When and How To Record Properly
Accounting for Restricted Grants When and How To Record Properly
TechSoup
 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptx
heathfieldcps1
 
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
yarusun
 
Talking Tech through Compelling Visual Aids
Talking Tech through Compelling Visual AidsTalking Tech through Compelling Visual Aids
Talking Tech through Compelling Visual Aids
MattVassar1
 
220711130083 SUBHASHREE RAKSHIT Internet resources for social science
220711130083 SUBHASHREE RAKSHIT  Internet resources for social science220711130083 SUBHASHREE RAKSHIT  Internet resources for social science
220711130083 SUBHASHREE RAKSHIT Internet resources for social science
Kalna College
 
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
biruktesfaye27
 
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
220711130100 udita Chakraborty  Aims and objectives of national policy on inf...220711130100 udita Chakraborty  Aims and objectives of national policy on inf...
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
Kalna College
 
Decolonizing Universal Design for Learning
Decolonizing Universal Design for LearningDecolonizing Universal Design for Learning
Decolonizing Universal Design for Learning
Frederic Fovet
 
(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"
MJDuyan
 
Observational Learning
Observational Learning Observational Learning
Observational Learning
sanamushtaq922
 
A Quiz on Drug Abuse Awareness by Quizzito
A Quiz on Drug Abuse Awareness by QuizzitoA Quiz on Drug Abuse Awareness by Quizzito
A Quiz on Drug Abuse Awareness by Quizzito
Quizzito The Quiz Society of Gargi College
 
Creating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptxCreating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptx
Forum of Blended Learning
 
bryophytes.pptx bsc botany honours second semester
bryophytes.pptx bsc botany honours  second semesterbryophytes.pptx bsc botany honours  second semester
bryophytes.pptx bsc botany honours second semester
Sarojini38
 
220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx
Kalna College
 
Creativity for Innovation and Speechmaking
Creativity for Innovation and SpeechmakingCreativity for Innovation and Speechmaking
Creativity for Innovation and Speechmaking
MattVassar1
 
Information and Communication Technology in Education
Information and Communication Technology in EducationInformation and Communication Technology in Education
Information and Communication Technology in Education
MJDuyan
 
Cross-Cultural Leadership and Communication
Cross-Cultural Leadership and CommunicationCross-Cultural Leadership and Communication
Cross-Cultural Leadership and Communication
MattVassar1
 

Recently uploaded (20)

Slides Peluncuran Amalan Pemakanan Sihat.pptx
Slides Peluncuran Amalan Pemakanan Sihat.pptxSlides Peluncuran Amalan Pemakanan Sihat.pptx
Slides Peluncuran Amalan Pemakanan Sihat.pptx
 
What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17
 
220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology
 
Accounting for Restricted Grants When and How To Record Properly
Accounting for Restricted Grants  When and How To Record ProperlyAccounting for Restricted Grants  When and How To Record Properly
Accounting for Restricted Grants When and How To Record Properly
 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptx
 
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
Get Success with the Latest UiPath UIPATH-ADPV1 Exam Dumps (V11.02) 2024
 
Talking Tech through Compelling Visual Aids
Talking Tech through Compelling Visual AidsTalking Tech through Compelling Visual Aids
Talking Tech through Compelling Visual Aids
 
220711130083 SUBHASHREE RAKSHIT Internet resources for social science
220711130083 SUBHASHREE RAKSHIT  Internet resources for social science220711130083 SUBHASHREE RAKSHIT  Internet resources for social science
220711130083 SUBHASHREE RAKSHIT Internet resources for social science
 
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
Ethiopia and Eritrea Eritrea's journey has been marked by resilience and dete...
 
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
220711130100 udita Chakraborty  Aims and objectives of national policy on inf...220711130100 udita Chakraborty  Aims and objectives of national policy on inf...
220711130100 udita Chakraborty Aims and objectives of national policy on inf...
 
Decolonizing Universal Design for Learning
Decolonizing Universal Design for LearningDecolonizing Universal Design for Learning
Decolonizing Universal Design for Learning
 
(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"
 
Observational Learning
Observational Learning Observational Learning
Observational Learning
 
A Quiz on Drug Abuse Awareness by Quizzito
A Quiz on Drug Abuse Awareness by QuizzitoA Quiz on Drug Abuse Awareness by Quizzito
A Quiz on Drug Abuse Awareness by Quizzito
 
Creating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptxCreating Images and Videos through AI.pptx
Creating Images and Videos through AI.pptx
 
bryophytes.pptx bsc botany honours second semester
bryophytes.pptx bsc botany honours  second semesterbryophytes.pptx bsc botany honours  second semester
bryophytes.pptx bsc botany honours second semester
 
220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx
 
Creativity for Innovation and Speechmaking
Creativity for Innovation and SpeechmakingCreativity for Innovation and Speechmaking
Creativity for Innovation and Speechmaking
 
Information and Communication Technology in Education
Information and Communication Technology in EducationInformation and Communication Technology in Education
Information and Communication Technology in Education
 
Cross-Cultural Leadership and Communication
Cross-Cultural Leadership and CommunicationCross-Cultural Leadership and Communication
Cross-Cultural Leadership and Communication
 

Data Warehousing & Basic Architectural Framework

  • 1. Prof. S. K. Pandey, I.T.S, Ghaziabad Data Warehousing & Mining UNIT – II
  • 2. Prof. S.K. Pandey, I.T.S, Ghaziabad 2 Syllabus of Unit - II  DATA Warehousing  Data Warehousing Components  Building a Data Warehouse  Warehouse Database  Mapping the Data Warehouse to a Multiprocessor Architecture  DBMS Schemas for Decision Support  Data Extraction, Cleanup & Transformation Tools  Metadata.
  • 3. Prof. S.K. Pandey, I.T.S, Ghaziabad 3 Data Warehouse • The Data warehouse is an environment, not a product. • It is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational data store. • Data warehousing is a blend of technologies and components aimed at effective integration of operation database into an environment that enables strategic use of data. • These technologies include relational and multi-dimensional database management system, client/ server architecture, meta- data modeling and repositories, graphical user interface etc.
  • 4. Prof. S.K. Pandey, I.T.S, Ghaziabad 4 Data Warehousing Components
  • 5. Data Warehousing Components  The data warehouse architecture is based on a relational database management system server that functions as the central repository for informational data. Operational data and processing is completely separated from data warehouse processing. This central information repository is surrounded by a number of key components designed to make the entire environment functional, manageable and accessible by both the operational systems that source data into the warehouse and by end-user query and analysis tools. Prof. S.K. Pandey, I.T.S, Ghaziabad 5
  • 6. Components of Data Warehouse continued…  There are following seven components of a Data Warehouse: – Data Warehouse Database – Sourcing, Acquisition, Cleanup and Transformation Tools – Meta Data – Access (Query) Tools The query tool allows executives and other users real-time access to the Data Warehouse database for query generation, result displays, reports and data exports – Data Marts – Data Warehouse Administration and Management – Information Delivery System Prof. S.K. Pandey, I.T.S, Ghaziabad 6
  • 7. Prof. S.K. Pandey, I.T.S, Ghaziabad 7 Components & Framework
  • 8. 1. Data Warehouse Database The central data warehouse database is the cornerstone of the data warehousing environment. Certain data warehouse attributes, such as very large database size, ad hoc query processing and the need for flexible user view creation including aggregates, multi-table joins and drill-downs, have become drivers for different technological approaches to the data warehouse database. These approaches include: – Parallel relational database designs for scalability that include shared-memory, shared disk, or shared-nothing models implemented on various multiprocessor configurations (symmetric multiprocessors or SMP, massively parallel processors or MPP, and/or clusters of uni- or multiprocessors). – An innovative approach to speed up a traditional RDBMS by using new index structures to bypass relational table scans. – Multidimensional databases (MDDBs) that are based on proprietary database technology. Multi-dimensional databases are designed to overcome any limitations placed on the warehouse by the nature of the relational data model. MDDBs enable on-line analytical processing (OLAP) tools that architecturally belong to a group of data warehousing components jointly categorized as the data query, reporting, analysis and mining tools. Prof. S.K. Pandey, I.T.S, Ghaziabad 8
  • 9. 2. Sourcing, Acquisition, Cleanup and Transformation Tools The data sourcing, cleanup, transformation and migration tools perform all of the conversions, summarizations, key changes, structural changes and condensations needed to transform disparate data into information that can be used by the decision support tool. They produce the programs and control statements, including the COBOL programs, MVS job-control language (JCL), UNIX scripts, and SQL data definition language (DDL) needed to move data into the data warehouse for multiple operational systems. These tools also maintain the meta data. The functionality includes: – Removing unwanted data from operational databases – Converting to common data names and definitions – Establishing defaults for missing data – Accommodating source data definition changes Prof. S.K. Pandey, I.T.S, Ghaziabad 9
  • 10. Prof. S.K. Pandey, I.T.S, Ghaziabad 10 ETL Tools  ETL tools are the equivalent of schema mappings in virtual integration, but are more powerful  Some of the Well Known ETL Tools The most well known commercial tools are Ab Initio, IBM InfoSphere DataStage, Informatica, Oracle Data Integrator and SAP Data Integrator. There are several open source ETL tools, among others: Apatar, CloverETL, Pentaho and Talend.  Arbitrary pieces of code to take data from a source, convert it into data for the warehouse: – Import filters – read and convert from data sources – Data Transformations – join, aggregate, filter, convert data – De-duplication – finds multiple records referring to the same entity, merges them – Profiling – builds tables, histograms, etc. to summarize data – Quality management – test against master values, known business rules, constraints, etc.
  • 11. 3. Meta Data Meta data is data about data that describes the data warehouse. It is used for building, maintaining, managing and using the data warehouse. Meta data can be classified into: – Technical meta data, which contains information about warehouse data for use by warehouse designers and administrators when carrying out warehouse development and management tasks. – Business meta data, which contains information that gives users an easy-to-understand perspective of the information stored in the data warehouse. Prof. S.K. Pandey, I.T.S, Ghaziabad 11
  • 12. 4. Access (Query) Tools Query and Reporting tools can be divided into two groups: – Reporting Tools and Managed Query Tools – Reporting tools can be further divided into production reporting tools and report writers.  Production reporting tools let companies generate regular operational reports or support high-volume batch jobs such as calculating and printing paychecks.  Report writers, on the other hand, are inexpensive desktop tools designed for end-users. – Managed query tools shield end users from the complexities of SQL and database structures by inserting a meta-layer between users and the database. These tools are designed for easy-to-use, point-and-click operations that either accept SQL or generate SQL database queries. Prof. S.K. Pandey, I.T.S, Ghaziabad 12
  • 13. 5. Data Mart  the term data mart means different things to different people. A rigorous definition of this term is a data store that is subsidiary to a data warehouse of integrated data. The data mart is directed at a partition of data (often called a subject area) that is created for the use of a dedicated group of users. These could be classified in two categories: – Dependent Data Marts – Independent Data Marts Prof. S.K. Pandey, I.T.S, Ghaziabad 13
  • 14. Prof. S.K. Pandey, I.T.S, Ghaziabad 14 Dependent Data Marts: These types of data marts, data is sourced from the data warehouse, have a high value because no matter how they are deployed and how many different enabling technologies are used, different users are all accessing the information views derived from the single integrated version of the data. Independent Data Marts: Unfortunately, the misleading statements about the simplicity and low cost of data marts sometimes result in organizations or vendors incorrectly positioning them as an alternative to the data warehouse. This viewpoint defines independent data marts that in fact, represent fragmented point solutions to a range of business problems in the enterprise. This type of implementation should be rarely deployed in the context of an overall technology or applications architecture. Indeed, it is missing the ingredient that is at the heart of the data warehousing concept -- that of data integration.
  • 15. Prof. S.K. Pandey, I.T.S, Ghaziabad 15 6. Data Warehouse Administration and Management Managing data warehouses includes: 1. Security and priority management 2. Monitoring updates from the multiple sources 3. Data quality checks 4. Managing and updating meta data 5. Auditing and reporting data warehouse usage and status 6. Purging data 7. Replicating, sub-setting and distributing data 8. Backup and Recovery and 9. Data warehouse storage management. 15
  • 16. Prof. S.K. Pandey, I.T.S, Ghaziabad 16Prof. S.K. Pandey, I.T.S, Ghaziabad 16 7. Information Delivery System • The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations according to some user-specified scheduling algorithm. • In other words, the information delivery system distributes warehouse-stored data and other information objects to other data warehouses and end-user products such as spreadsheets and local databases. •Delivery of information may be based on time of day or on the completion of an external event. •The rationale for the delivery systems component is based on the fact that once the data warehouse is installed and operational, its users don't have to be aware of its location and maintenance. 16
  • 17. Prof. S.K. Pandey, I.T.S, Ghaziabad 17 Building a Data Warehouse Why a Data Warehouse Application – Business Perspectives There are several reasons why organizations consider Data Warehousing a critical need. From a business prospective, to strive and succeed in today’s highly competetive global environment, business users demand business answers mainly because: • Decisions need to be made quickly and correctly, using all available data • Users are business domain experts, not computer professionals • The amount of data increasing in the data stores, which affects response time and the sheer ability to comprehend its content. • Competitions is heating up in the areas of business intelligence and added information value.
  • 18. Prof. S.K. Pandey, I.T.S, Ghaziabad 18Prof. S.K. Pandey, I.T.S, Ghaziabad 18 Building a Data Warehouse Why a Data Warehouse Application – Technology Perspectives • There are several technology reasons also for existence of Data Warehousing. • First, the Data Warehouse is designed to address the incompatibility of informational and operational transactional systems. These two classes of information systems are designed to satisfy different , often incompatible, requirements. • Secondly, the IT infrastructure is changing rapidly, and its capabilities are increasing, as evidenced by the following: • The prices of MIPS continues to decline, while the power of processors doubles every 2 years • The prices of digital storage is rapidly dropping • Network bandwidth is increasing, while the price of high bandwidth is decreasing • The workplace is increasingly heterogeneous with respect to both the hardware and software • Legacy systems need to, and can, be integrated with new applications
  • 19. Prof. S.K. Pandey, I.T.S, Ghaziabad 19Prof. S.K. Pandey, I.T.S, Ghaziabad 19Prof. S.K. Pandey, I.T.S, Ghaziabad 19 Building a Data Warehouse 1. Business Considerations (Return on Investment) 2. Design Considerations 3. Technical Considerations 4. Implementation Considerations 5. Integrated Solutions 6. Benefits of Data Warehousing
  • 20. Prof. S.K. Pandey, I.T.S, Ghaziabad 20Prof. S.K. Pandey, I.T.S, Ghaziabad 20Prof. S.K. Pandey, I.T.S, Ghaziabad 20Prof. S.K. Pandey, I.T.S, Ghaziabad 20 Building a Data Warehouse Contd.. 1. Business Considerations (Return on Investment) 1. Approach • The Top-down Approach, meaning that an organization has developed an enterprise data model, collected enterprise-wide business requirements, and decided to build an enterprise data warehouse with subset data marts. • The Bottom-up Approach, implying that the business priorities resulted in developing individual data marts, which are then integrated into enterprise data warehouse. 2. Organizational Issues A Data Warehouse, in general, is not truly a technological issue, rather, it should be more concerned with identifying and establishing information requirements, the data sources to fulfill these requirements, and timeliness.
  • 21. Prof. S.K. Pandey, I.T.S, Ghaziabad 21Prof. S.K. Pandey, I.T.S, Ghaziabad 21Prof. S.K. Pandey, I.T.S, Ghaziabad 21Prof. S.K. Pandey, I.T.S, Ghaziabad 21Prof. S.K. Pandey, I.T.S, Ghaziabad 21 Building a Data Warehouse Contd.. 2. Design Consideration To be a successful, a data warehouse designer must take a holistic approach – consider all data warehouse components as parts of a single complex system and take into the account all possible data stores and all known usage requirements. Failing to do so may easily result in a data warehouse design that is skewed toward a particular business requirement, a particular data sources, or a selected access tool. This is also one of the reasons why a data warehouse is rather difficult to build. The main factors include: • Heterogeneity of Data sources, which affects data conversion, quality, timeliness • Use of historical data, while implies that data may be “old”. • Tendency of databases to grow very large
  • 22. Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22Prof. S.K. Pandey, I.T.S, Ghaziabad 22 Building a Data Warehouse Contd.. 2. Design Consideration - In addition to the general considerations, there are several specific points relevant to the data warehouse design: • Data Content • Metadata • Data Distribution One of the biggest challenge when designing a data warehouse is the data placement and distribution strategy. • Tools These tools provide facilities for defining the transformation and cleanup rules, data movement (from operational sources to the warehouses, end-user query, reporting, and data analysis. • Performance consideration
  • 23. Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23Prof. S.K. Pandey, I.T.S, Ghaziabad 23 Building a Data Warehouse Contd.. 3. Technical Considerations A number of technical issues are to be considered when designing and implementing a Data Warehouse environment. 1. The Hardware Platform that would house the Data Warehouse for parallel query scalability. (Uni-Processor, Multi-processor, etc) 2. The DBMS that supports the warehouse database 3. The communication infrastructure that connects the warehouse, data marts, operational systems, and end users 4. The hardware platform and software to support the metadata repository 5. The systems management framework that enables centralized management and administration to the entire environment.
  • 24. Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24Prof. S.K. Pandey, I.T.S, Ghaziabad 24 Building a Data Warehouse Contd.. 4. Implementation Considerations i. Access Tools Currently no single tool in the market can handle all possible data warehouse access needs. Therefore, most implementations rely on a suite of tools. Examples of Access types include: a. Simple Tabular for reporting b. Ranking c. Multi-variable Analysis d. Time Series Analysis e. Data Visualization, Graphing, Charting and pivoting f. Complex Textual Search g. Statistical Analysis h. AI Techniques for testing of hypothesis, trends discovery, definition, validation of Data Clusters and segments i. Information Mapping (i.e. mapping of Spatial Data in geographic information systems) j. Ad-hoc User Specified Queries k. Pre-defined repeatable queries l. Interactive drill-down reporting and analysis m. Complex queries with multiple joins, multi-level subquesries, and sophisticated search criteria.
  • 25. Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 25Prof. S.K. Pandey, I.T.S, Ghaziabad 2525 Building a Data Warehouse Contd.. 4. Implementation Considerations ii. Data Extraction, Cleanup, Transformation, and Migration As a components of the Data Warehouse architecture, proper attention must be given to Data Extraction, which represents a critical success factor for a data warehouse architecture. 1. The ability to identify data in the data source environments that can be read by conversion tool is important. This additional step may affect the timeliness of data delivery to the warehouse. 2. Support for the flat files. (VSAM, ISM, IDMS) is critical, since bulk of the corporate data is still maintained in this type of data storage. 3. The capability to merge data from multiple data stores is required in many installations. 4. The specification interface to indicate the data to extracted and the conversion criteria is important. 5. The ability to read information from data dictionaries or import information from repository product is desired. 6. The ability to perform data-type and character-set translation is a requirement when moving data moving between incompatible systems. 7. The capability to create summarization, aggregation, and derivation records and fields is very important.
  • 26. 26Prof. S.K. Pandey, I.T.S, Ghaziabad 2626262626262626 Building a Data Warehouse Contd.. 4. Implementation Considerations iii. Data Placement Strategies As Data Warehouse grows, there are at least two options for Data Placement. One is to put some of the data in the data warehouse into another storage media (WORM, RAID). Second option is to distribute data in data warehouse across multiple servers. Some criteria must be established for dividing it over the servers – by geography, organization unit, time, function, etc. However, the data is divided, a single source of meta data across the entire organization is required. Hence this configuration requires both corporation-wide and the meta data managed for any given server.
  • 27. Prof. S.K. Pandey, I.T.S, Ghaziabad 27272727272727272727 Building a Data Warehouse Contd.. 4. Implementation Considerations iv. Metadata A frequently occurring problem in Data Warehouse is the problem of communicating to the end user what information resides in the data warehouse and how it can be accessed. The key to providing users and applications with a roadmap to the information stored in the warehouse is the metadata. It can define all data elements and their attributes, data sources and timing, and the rules that govern data use and data transformations. Meta data needs to be collected as the warehouse is designed and built.
  • 28. Prof. S.K. Pandey, I.T.S, Ghaziabad 2828Prof. S.K. Pandey, I.T.S, Ghaziabad 2828282828282828 Building a Data Warehouse Contd.. 4. Implementation Considerations v. User Sophistication Levels Data Warehousing is relatively new phenomenon, and a certain degree of sophistication is required on the end user’s part to effectively use the warehouse. The users can be classified on the basis of their skill level in accessing the warehouse: 1. Casual Users: These users are most comfortable retrieving information from the warehouse in pre-defined formats, and running preexisting queries and reports. 2. Power Users: In their delay activities, these users typically combine predefined queries with some relatively simple and ad-hoc queries that they create themselves. These users need access tools that combine the simplicity of pre-defined queries and reports with a certain degree of flexibility. 3. Experts: These users tend to create their own queries and perform sophisticated analysis on the information they retrieve from the warehouse. These users know the data, tools and database well enough to demand tools that allow for maximum flexibility and adaptability.
  • 29. 29292929292929 Benefits of Data Warehouse Successfully implemented data warehousing can realize some significance benefits which can be categorized in two categories: 1. Tangible Benefits: 1. Product inventory turnover is improved 2. Costs of product introduction are decreased with improved selection of target markets. 3. More cost effective decision making is enabled by separating (ad-hoc) query processing from running against operational database. 4. Better business intelligence is enabled by increased quality and flexibility of market analysis available through multi-level data structures, which may range from detailed to highly summarized. 2. Intangible Benefits: 1. Improved productivity 2. Reduced redundant processing, support, and software to support overlapping decision support applications 3. Enhanced Customer relations through improved knowledge of individual requirements and trends, through customization, improved communications, and tailored product offerings. 4. Enabling business process reengineering – data warehousing can provide useful insights into work process themselves, Prof. S.K. Pandey, I.T.S, Ghaziabad
  • 30. Prof. S.K. Pandey, I.T.S, Ghaziabad 30 Warehouse Database  The organizations that embarked on data warehousing development deal with ever increasing amounts of data. Generally speaking, the size of a data warehouse rapidly approaches the point where the search for better performance and scalability becomes a real necessity. This search aims to pursue two goals: – Speed-up: the ability to execute the same request on the same amount data in less time – Scale-up: the ability to obtain the same performance on the same request as the database size increases. An additional and important goal is to achieve linear speed-up and scale-up, doubling the number of processors cuts the response time in half (linear speed-up) or provides the same performance on twice as much data (linear scale-up).
  • 31. Prof. S.K. Pandey, I.T.S, Ghaziabad 31 Mapping the Data Warehouse to a Multiprocessor Architecture  The goals of linear performance and scalability (discussed in previous slide) can be satisfied by parallel hardware architectures, parallel operating systems, and parallel DBMSs. Parallel hardware architectures are based on Multi-processor systems designed as a Shared-memory model (symmetric multiprocessors), Shared-disk model or distributed-memory model (MPP and Clusters of SMPs). Parallelism can be achieved in two different ways: – Horizontal Parallelism (Database is partitioned across different disks) – Vertical Parallelism (occurs among different tasks – all components query operations i.e. scans, join, sort) – Data Partitioning
  • 32. Prof. S.K. Pandey, I.T.S, Ghaziabad 32Prof. S.K. Pandey, I.T.S, Ghaziabad 32 Database Architectures for Parallel Processing  Shared-memory Architecture  Shared Disk Architecture  Shared-nothing Architecture  Combined Architecture
  • 33. Prof. S.K. Pandey, I.T.S, Ghaziabad 33Prof. S.K. Pandey, I.T.S, Ghaziabad 33Prof. S.K. Pandey, I.T.S, Ghaziabad 33 Parallel RDBMS Features  Data Warehouse development requires a good understanding of all architectural components, including the data warehouse DBMS Platform. Understanding the basic architecture of Warehouse database is the first step in evaluating and selecting a product.  State of the art parallel features the developers and users of the Warehouse should demand from the DBMS vendor:  Scope and techniques of Parallel DBMS  Queries (Insert/ Update/Delete)  DBMS that supports parallel database load, backup, reorganization and recovery is much better positioned for VLDBs.  Optimizer Implementation  Application Transparency  The Parallel environment  DBMS Management Tools  Price/ Performance
  • 34. Prof. S.K. Pandey, I.T.S, Ghaziabad 34Prof. S.K. Pandey, I.T.S, Ghaziabad 34Prof. S.K. Pandey, I.T.S, Ghaziabad 34Prof. S.K. Pandey, I.T.S, Ghaziabad 34 Parallel DBMS Vendors  ORACLE – Oracle supports Parallel Database processing with its add-on Oracle Parallel Server Option (OPS) and Parallel Query Option (PQO) with Query Coordinator.  Informix – Informix developed its Dynamic Scalable Architecture (DSA) to support Shared-Memory, Shared-Disk, and Shared-Nothing Models. Informix OnLine release 8, also known as XPS (eXtended Parallel Server), supports MPP Hardware platforms that include IBM, SP, AT & T, Sun, HP, ICL Goldrush, with sequent, Siemens, Pyramid etc.  IBM – DB2 Parallel Edition (DB2 PE), a Database based on DB2/6000 Server Architecture; latest version is DB2 Universal Database.  Sybase – Sybase implemented its parallel DBMS functionality in a product called SYBASE MPP (formerly Navigational Server). It was jointly developed by Sybase and NCR (formerly AT&T GIS), and its first release was targeted for the AT&T 3400, 3500 (both SMP) and 3600 (MPP) Platforms.  Other RDBMS Products i. NCR Teradata ii. Tandem NonStop SQL/MP  Specialized Database Products - i. Red Brick Systems ii. White Cross Systems Inc.
  • 35. Prof. S.K. Pandey, I.T.S, Ghaziabad 35 DBMS Schemas for Decision Support  Data Warehousing projects were forced to choose between a data model and a corresponding database schema that is intuitive for analysis but performs poorly and a model-schema that performs better but is not well suited for analysis.  As Data Warehousing continued to mature, new approaches to schema design resulted in schemas better suited to business analysis that is so crucial to successful data warehousing.  The schema methodology that is gaining widespread acceptance for Data Warehousing is the Star Schema.
  • 36. Prof. S.K. Pandey, I.T.S, Ghaziabad 36Prof. S.K. Pandey, I.T.S, Ghaziabad 36 Data Layout for best Access  The original objective in developing an abstract model known as Relational Model were to address a number of shortcomings of non-relational DBMS and application development.  The typical requirements for the RDBMS supporting operational systems are based on the need to effectively support a large number of small but simultaneous read and write requests.  The demand placed on the RDBMS by a Data Warehouse are very different. A data warehouse RDBMS typically needs to process queries that are large, complex, ad-hoc and data intensive.  Solving modern business problems such as market analysis and financial forecasting requires query-centric database schemas that are array-oriented and multi-dimensional in nature.
  • 37. Prof. S.K. Pandey, I.T.S, Ghaziabad 37Prof. S.K. Pandey, I.T.S, Ghaziabad 37 Multi-dimensional Data Model  The Multi-dimensional nature of business questions is reflected in the fact that, for example, marketing managers are no longer satisfied by asking simple one-dimensional questions such as “How much revenue did the new product generate by month, in northeastern division, broken down by user demographic, by sales office, relative to the previous version of the product, compared with the plan?” – a six dimensional question.
  • 38. Prof. S.K. Pandey, I.T.S, Ghaziabad 38Prof. S.K. Pandey, I.T.S, Ghaziabad 38Prof. S.K. Pandey, I.T.S, Ghaziabad 38 STAR SCHEMA  The Multi-dimensional view of Data that is expressed using relational database semantics is provided by the database schema design called Star Schema.  The basic premise of Star Schema is that information can be classified into two groups: facts and dimensions.  Facts are the core Data element being analyzed. For example, units of individual items sold are facts.  Dimensions are attributes about the facts. For example, dimensions are the product types purchased and date of purchase.
  • 39. Prof. S.K. Pandey, I.T.S, Ghaziabad 39 Data Extraction, Cleanup & Transformation Tools  The task of capturing data from a source data system, cleaning and transforming it and then loading the results into a target data system can be carried out either by separate products, or by a single integrated solution. More contemporary integrated solutions can fall into one of the categories described below: – Code Generators – Database data Replications – Rule-driven Dynamic Transformation Engines (Data Mart Builders)
  • 40. Prof. S.K. Pandey, I.T.S, Ghaziabad 40Prof. S.K. Pandey, I.T.S, Ghaziabad 40 Code Generator – It creates 3GL/4GL transformation programs based on source and target data definitions, and data transformation and enhancement rules defined by the developer. – This approach reduces the need for an organization to write its own data capture, transformation, and load programs. These products employ DML Statements to capture a set of the data from source system. – These are used for data conversion projects, and for building an enterprise-wide data warehouse, when there is a significant amount of data transformation to be done involving a variety of different flat files, non-relational, and relational data sources.
  • 41. Prof. S.K. Pandey, I.T.S, Ghaziabad 41Prof. S.K. Pandey, I.T.S, Ghaziabad 41Prof. S.K. Pandey, I.T.S, Ghaziabad 41 Database Data Replication Tools – These tools employ database triggers or a recovery log to capture changes to a single data source on one system and apply the changes to a copy of the data source data located on a different system. – Most replication products do not support the capture of changes to non-relational files and databases, and often do not provide facilities for significant data transformation and enhancement. – These point-to-point tools are used for disaster recovery and to build an operational data store, a data warehouse, or a data mart when the number of data sources involved are small and a limited amount of data transformation and enhancement is required.
  • 42. Prof. S.K. Pandey, I.T.S, Ghaziabad 42Prof. S.K. Pandey, I.T.S, Ghaziabad 42Prof. S.K. Pandey, I.T.S, Ghaziabad 42Prof. S.K. Pandey, I.T.S, Ghaziabad 42 Rule-driven Dynamic Transformation Engines – They are also known as Data Mart Builders and capture data from a source system at User-defined intervals, transform data, and then send and load the results into a target environment, typically a data mart. – To date most of the products of this category support only relational data sources, though now this trend have started changing. – Data to be captured from source system is usually defined using query language statements, and data transformation and enhancement is done on a script or a function logic defined to the tool. – With most tools in this category, data flows from source systems to target systems through one or more servers, which perform the data transformation and enhancement. These transformation servers can usually be controlled from a single location, making the job of such environment much easier.
  翻译: