尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Unit: 2 Data Preprocessing
Outline of the chapter
• Data types and attribute types
• Data pre- processing
• OLAP
• Characteristics of OLAP Systems
• Multidimensional views and data cubes
• Data cube implementations
• Data cube operations
• Guidelines for OLAP Implementation.
July 2, 2019 2Compiled By: Kamal Acharya
2.1. Data types and attribute types
• Outline:
– Attributes and Objects
– Types of Data
July 2, 2019 3Compiled By: Kamal Acharya
What is Data?
• Collection of data objects
and their attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a person,
temperature, etc.
– Attribute is also known as
variable, field, characteristic,
dimension, or feature
• A collection of attributes
describe an object
– Object is also known as record,
point, case, sample, entity, or
instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes/ dimension
Objects
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute for a particular object
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
July 2, 2019 5Compiled By: Kamal Acharya
Types of Attributes
• The type of attribute is determined by the set of
possible values the attribute can have.
• There are different types of attributes:
– Nominal attributes
– Binary attributes
– Ordinal attributes
– Numeric attributes
• Interval-scaled attributes
• Ratio-scaled attributes
July 2, 2019 6Compiled By: Kamal Acharya
Contd..
• Nominal Attributes:
– Nominal means “relating to names.”
– The values of a nominal attribute are symbols or names of
things.
– Each value represents some kind of category, code, or state, and
so nominal attributes are also referred to as categorical.
– The values do not have any meaningful order.
– E.g., : Hair_color = { black, brown, grey, red, white, etc}
Marital _status= { single, married, divorced}
July 2, 2019 7Compiled By: Kamal Acharya
Contd..
• Binary Attributes:
– A binary attribute is a nominal attribute with only two categories or states:
0 or 1, where 0 typically means that the attribute is absent, and 1 means
that it is present.
– Symmetric binary: both outcomes equally important
• e.g., gender = {male ,female}
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome(rarest)
(e.g., HIV positive) and other by 0(e.g., HIV negative)
July 2, 2019 8Compiled By: Kamal Acharya
Contd..
• Ordinal Attribute:
– An ordinal attribute is an attribute with possible values that
have a meaningful order or ranking among them, but the
magnitude between successive values is not known.
– E.g.: suppose that drink size corresponds to the size of drinks available at
a fast-food restaurant. This nominal attribute has three possible values:
small, medium, and large. The values have a meaningful sequence (which
corresponds to increasing drink size); however, we cannot tell from the
values how much bigger, say, a medium is than a large.
July 2, 2019 9Compiled By: Kamal Acharya
Contd..
• Numeric Attributes:
– It is a measurable quantity, represented in integer or real values.
– Numeric attributes can be interval-scaled or ratio-scaled.
– Interval-scaled :
• Interval-scaled attributes are measured on a scale of equal-size units.
• E.g.: calendar dates, temperature in Celsius
– Ratio-scaled attributes:
• a value as being a multiple (or ratio) of another value.
• it has a zero point or character of origin
• Ratio are meaningful
• examples are height, weight, money, age
July 2, 2019 10Compiled By: Kamal Acharya
Discrete vs. Continuous Attributes
• There are many ways to organize attribute types.
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., Roll number, the set of words in a collection of
documents
– Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, Speed, etc
– Continuous attributes are typically represented as floating-point
variables
July 2, 2019 11Compiled By: Kamal Acharya
Properties of Attribute Values
• Distinctness: = ≠
• Order: < >
• Addition: + -
• Multiplication: * /
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
July 2, 2019 Compiled By: Kamal Acharya 12
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Generic graph
– Social or information networks
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
July 2, 2019 13Compiled By: Kamal Acharya
Record Data
• Data that consists of a collection of records, each of which consists of a
fixed set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
July 2, 2019 14Compiled By: Kamal Acharya
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute
• Such data set can be represented by an m by n matrix, where there
are m rows, one for each object, and n columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
ThicknessLoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
ThicknessLoadDistanceProjection
of y load
Projection
of x Load
July 2, 2019 15Compiled By: Kamal Acharya
Document Data
• Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.
Document 1
season
timeout
lost
win
game
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
July 2, 2019 16Compiled By: Kamal Acharya
Transaction Data
• A special type of record data, where
– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
July 2, 2019 17Compiled By: Kamal Acharya
Graph Data
• Examples:
– Generic graph
– World-wide web
– Social or information networks
5
2
1
2
5
July 2, 2019 18Compiled By: Kamal Acharya
Ordered Data
• Video data: sequence of images
• Temporal data: time-series
• Sequential Data: transaction sequences
July 2, 2019 19Compiled By: Kamal Acharya
2.2. Data Preprocessing
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
July 2, 2019 20Compiled By: Kamal Acharya
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1, 2, 3”, now rating “A, B, C”
July 2, 2019 21Compiled By: Kamal Acharya
Why Is Data Preprocessing Important?
• To make data more suitable for data mining.
• To improve the data mining analysis with respect to time, cost
and quality.
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data mining example:
• a classification model for detecting people who are loan risks is built using poor
data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
July 2, 2019 22Compiled By: Kamal Acharya
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar analytical
results
• Data discretization
– Part of data reduction but with particular importance, especially for numerical data
July 2, 2019 23Compiled By: Kamal Acharya
Contd..
July 2, 2019 24Compiled By: Kamal Acharya
Data Preprocessing
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
July 2, 2019 25Compiled By: Kamal Acharya
Data Cleaning
• If data is dirty(incomplete, noisy, inconsistent) then:
– Users can not trust any results of data mining
– Can cause confusion for the data mining procedure, resulting in unreliable output.
• So, data cleaning is required.
• To clean data the following data cleaning tasks are performed:
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
July 2, 2019 26Compiled By: Kamal Acharya
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to:
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
• Missing data may need to be inferred.
July 2, 2019 27Compiled By: Kamal Acharya
How to Handle Missing Data?
• Ignore the tuple:
– usually done when class label is missing
• Fill in the missing value manually:
– tedious + infeasible?
• Use a global constant to fill in the missing value:
– e.g., “unknown”
• Use the attribute mean to fill in the missing value
• Use the most probable value to fill in the missing value:
July 2, 2019 28Compiled By: Kamal Acharya
Noisy Data
• Noise:
– random error or variance in a measured variable
• Noise (Incorrect attribute) values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
July 2, 2019 29Compiled By: Kamal Acharya
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into bins
– then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect doubtful values and check by human
• Regression
– smooth by fitting the data into regression functions
July 2, 2019 30Compiled By: Kamal Acharya
Binning
• Three step process:
– Sort the data
– Make the bins by partitioning
– Smooth the data in each bins
July 2, 2019 31Compiled By: Kamal Acharya
Contd…
• Partitioning techniques to make bins:
– Equal-width (distance) partitioning:
– Equal-depth (frequency) partitioning
July 2, 2019 32Compiled By: Kamal Acharya
Contd…
– Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
– Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately
same number of samples
July 2, 2019 33Compiled By: Kamal Acharya
Example: Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
July 2, 2019 34Compiled By: Kamal Acharya
Data Preprocessing
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
July 2, 2019 35Compiled By: Kamal Acharya
Data Integration
• Data integration combines data from multiple sources into a
coherent store(e.g., DW).
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
• This can help improve the accuracy and speed of the subsequent
data mining process.
July 2, 2019 36Compiled By: Kamal Acharya
Contd..
• But, Entity identification problem:
• for the same real world entity, attribute values from different sources
are different.
• For example: how can the data analyst or the computer be sure that
customer id in one database and cust_number in another refer to the
same attribute?
• Solution: The Meta data can be used to help the transformation of data
July 2, 2019 37Compiled By: Kamal Acharya
Contd..
• Handling Redundant Data:
• Redundant data occur often when integration of multiple databases
– The same attribute may have different names in different
databases Careful integration of the data from multiple sources
may help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
July 2, 2019 38Compiled By: Kamal Acharya
Data Transformation
• In this preprocessing step, the data are transformed or consolidated
so that the resulting mining process may be more efficient, and the
patterns found may be easier to understand.
• Data Transformation Strategies:
– Smoothing
– Attribute construction
– Aggregation
– Normalization
– Discretization
– Concept hierarchy generation
July 2, 2019 39Compiled By: Kamal Acharya
Contd..
• Smoothing, which works to remove noise from the data.
Techniques include binning, regression, and clustering.
• Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of attributes
to help the mining process.
• Aggregation, where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total amounts.
July 2, 2019 40Compiled By: Kamal Acharya
Contd..
• Normalization, where the attribute data are scaled so as to fall
within a smaller range, such as: -1.0 to 1.0, or 0.0 to 1.0.
• Discretization, where the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior).
• Concept hierarchy generation, where attributes such as street can
be generalized to higher-level concepts, like city or country.
July 2, 2019 41Compiled By: Kamal Acharya
Data Preprocessing
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
July 2, 2019 42Compiled By: Kamal Acharya
Data Reduction
• Warehouse may store terabytes of data:
– Complex data mining may take a very long time to run on the
complete data set.
• Data reduction
– Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results.
July 2, 2019 43Compiled By: Kamal Acharya
Data Reduction Strategies
• Data reduction strategies
– Data cube aggregation
– Dimensionality reduction
– Histograms
– clustering
– sampling
– Discretization and concept hierarchy generation
July 2, 2019 44Compiled By: Kamal Acharya
Contd..
• Data Cube Aggregation:
– Data for sales per quarter, for the years 2008 to 2010.
– interested in the annual sales (total per year), rather than the total per
quarter.
– Thus, the data can be aggregated so that the resulting data summarize the
total sales per year instead of per quarter.
– The resulting data set is smaller in volume, without loss of information
necessary for the analysis task.
– This aggregation is illustrated in Figure below.
July 2, 2019 45Compiled By: Kamal Acharya
Contd..
– Dimensionality Reduction:
Feature selection (i.e., attribute subset selection):
• Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features
• reduce # of patterns in the patterns, easier to understand
July 2, 2019 46Compiled By: Kamal Acharya
Histograms
• A popular data reduction technique
• Divide data into buckets and store average (sum) for each bucket
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
July 2, 2019 47Compiled By: Kamal Acharya
Clustering
• Partition data set into clusters, and one can store cluster
representation only.
July 2, 2019 48Compiled By: Kamal Acharya
Sampling
• Sampling is the main technique employed for data reduction.
– It is often used for both the preliminary investigation of the data and the
final data analysis.
• Statisticians often sample because obtaining the entire set of data of
interest is too expensive or time consuming.
• Sampling is typically used in data mining because processing the
entire set of data of interest is too expensive or time consuming.
July 2, 2019 49Compiled By: Kamal Acharya
Contd…
• The key principle for effective sampling is the
following:
– Using a sample will work almost as well as using the entire
data set, if the sample is representative
– A sample is representative if it has approximately the same
properties (of interest) as the original set of data
July 2, 2019 50Compiled By: Kamal Acharya
Contd..
• Discretization
– reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
– Interval labels can then be used to replace actual data values.
• Concept hierarchies
– reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
July 2, 2019 51Compiled By: Kamal Acharya
Summary
• Data preparation is a big issue for both warehousing and
mining
• Data preparation includes
– Data cleaning and data integration
– Data reduction and feature selection
– Discretization
• A lot of methods have been developed but still an active area
of research
July 2, 2019 52Compiled By: Kamal Acharya
Data Warehouse: A Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
Other
sources
Data Storage
OLAP Server
July 2, 2019 53Compiled By: Kamal Acharya
OLAP
• OLAP is a software technology concerned with fast analysis of enterprise
information.
• Often OLAP systems are data warehouse front end software tools to make
aggregate data available efficiently to an enterprise’s decision makers
(analysts, managers and executives).
• Major OLAP applications are trend analysis over a number of time periods,
slicing, dicing , drill-down and roll-up to look at the data at different levels
of detail and pivoting or rotating to obtain a new multidimensional view.
July 2, 2019 54Compiled By: Kamal Acharya
Characteristics of OLAP Systems
• Comparison between OLTP and OLAP systems:
– This comparison highlights some of the characteristics of OLAP
systems.
– The difference between the two types of systems are as follows:
– Users:
• OLTP systems are designed for office workers while the OLAP
systems are designed for decision makers. Therefore , while an
OLTP system may be accessed by hundreds or even thousands of
users in a large enterprise, an OLAP system is likely to be accessed
only by a selected group of managers and may be used by dozens
of users.
July 2, 2019 55Compiled By: Kamal Acharya
Contd..
• Functions:
– OLTP systems are mission-critical (vital to the functioning of an
organization.). These systems carry out simple repetitive operations.
– OLAP systems on the other hand are management critical to support an
enterprise's decision support functions using analytical investigations.
These are ad-hoc and often much more complex operations.
July 2, 2019 56Compiled By: Kamal Acharya
Contd..
• Nature:
– Nature of queries in OLTP system is simple
– Nature of queries in OLAP system is complex
– Nature of usage of OLTP system is repetitive
– Nature of usage of OLAP system is mostly ad hoc
July 2, 2019 57Compiled By: Kamal Acharya
Contd..
• Design:
– OLTP systems are designed to be application-oriented while OLAP
systems are designed to be subject-oriented.
– OLTP systems view the enterprise data as a collection of tables while
OLAP systems view enterprise information as multidimensional.
July 2, 2019 58Compiled By: Kamal Acharya
Contd..
• Data:
– OLTP systems normally deal only with the current status of
information.
– On the other hand, OLAP systems require historical data over several
years since trend are often important in decision making.
July 2, 2019 59Compiled By: Kamal Acharya
Contd..
• Kinds of use:
– OLTP systems are used for read and write operations while OLAP
systems normally do not update the data but refresh the data.
July 2, 2019 60Compiled By: Kamal Acharya
Contd..
• Other features that distinguish between OLTP and OLAP system are summarized in
the following table:
July 2, 2019 61Compiled By: Kamal Acharya
FASMI Characteristics of OLAP systems
• The FASMI characteristics of OLAP systems, the name
derived from the first letters of the characteristics, are:
– Fast
– Analytic
– Shared
– Multidimensional
– Information
July 2, 2019 62Compiled By: Kamal Acharya
Contd..
• Fast:
– OLAP queries should be answered very quickly, perhaps
within seconds.
– To achieve such performance:
• the data structure must be efficient and the hardware must be powerful.
• Full pre-computation of aggregates
• Pre-compute the most commonly queried aggregates.
July 2, 2019 63Compiled By: Kamal Acharya
Contd..
• Analytic:
– An OLAP system must provide rich analytic functionality and it is
expected that most OLAP queries can be answered without any
programming.
– The system should be able to cope with any relevant queries for the
application and the user.
July 2, 2019 64Compiled By: Kamal Acharya
Contd..
• Shared:
– An OLAP system is a shared resource although it is unlikely to be
shared by hundreds of users.
– An OLAP system is likely to be accessed only by a selected group of
managers and may be use by mere dozens of users.
– Being a shared system, an OLAP system should provide adequate
security for confidentiality as well as integrity.
July 2, 2019 65Compiled By: Kamal Acharya
Contd..
• Multidimensional:
– This is the basic requirement.
– Whatever OLAP software is being used, it must provide a
multidimensional conceptual view of the data.
July 2, 2019 66Compiled By: Kamal Acharya
Contd..
• Information:
– OLAP systems usually obtain information from a data warehouse.
– The system should be able to handle a large amount of input data.
– The capacity of an OLAP system to handle information and its
integration with the data warehouse may be critical.
July 2, 2019 67Compiled By: Kamal Acharya
Codd’s OLAP characteristics
• The most important characteristics of OLAP systems provided by the
Codd are as follows:
– Multidimensional conceptual view
– Accessibility(OLAP as a mediator)
– Batch extraction vs interpretive
– Multi-user support
– Storing OLAP result
– Extraction of missing values
– Treatment of missing values
– Uniform reporting performance
– Generic dimensionality
– Unlimited dimensions and aggregation levels
July 2, 2019 68Compiled By: Kamal Acharya
Contd..
• Multidimensional conceptual view:
– By requiring a multidimensional view, it is possible to carry out
operations like slice and dice.
• Accessibility (OLAP as a mediator):
– The OLAP software should be sitting between data sources(e..g., a data
warehouse) and an OLAP front- end.
July 2, 2019 69Compiled By: Kamal Acharya
Contd..
• Batch extraction versus interpretive:
– An OLAP system should provide multidimensional data staging plus
partial pre-calculation of aggregates in large multidimensional
databases.
• Multi- user support:
– Since the OLAP system is shared, the OLAP software should provide
many normal database operations including retrieval, update,
concurrency control, integrity and security.
July 2, 2019 70Compiled By: Kamal Acharya
Contd..
• Storing OLAP results:
– OLAP results data should be kept separate from source data.
• Extraction of missing values:
– The OLAP system should distinguish missing values form zero values.
– A large data cube may have a large number of zeros as well as some
missing values.
– If a distinction is not made between zero values and missing values, the
aggregates are likely to be computed incorrectly.
July 2, 2019 71Compiled By: Kamal Acharya
Contd..
• Treatment of missing values:
– An OLAP system should ignore all missing values regardless of their
source.
– Correct aggregate values will be computed once the missing values are
ignored.
• Uniform reporting performance:
– Increasing the number of dimensions or database size should not
significantly degrade the reporting performance of the OLAP system.
– This is good objective although it may be difficult to achieve in
practice.
July 2, 2019 72Compiled By: Kamal Acharya
Contd..
• Generic dimensionality:
– An OLAP system should treat each dimension as equivalent in both its
structure and operational capabilities. Additional operational
capabilities may be granted to be selected dimensions but such
additional functions should be grantable to be any dimension
• Unlimited dimensions and aggregation levels:
– An OLAP system should allow unlimited dimensions and aggregations
and aggregation levels.
– but In practice, this is undesirable.
July 2, 2019 73Compiled By: Kamal Acharya
Example OLAPApplications
• Understanding and improving sales:
– OLAP can assist in finding the most popular products and the most
popular channels for selling the products.
• E,g., Findwhich itemsare frequentlysoldover the summerbut not overwinter?
– OLAP can assist in finding most profitable customers.
July 2, 2019 74Compiled By: Kamal Acharya
Contd..
• Understanding and reducing costs of doing business:
– Improving sales is one aspect of improving business, the other aspect is
to analyze costs and to control them as much as possible without
affecting sales.
– OLAP can assist in analyzing the costs associated with sales.
– In some cases, it may also be possible to identify expenditures that
produce a high return on investment.
July 2, 2019 75Compiled By: Kamal Acharya
Contd..
• CreditCard Companies:
• Given a new applicant, does (s)hea credit-worthy?
• Need to check other similarapplicants (age, gender, income, etc…) and
observehow theyperform,then do prediction for newapplicant
July 2, 2019 76Compiled By: Kamal Acharya
Multi-dimensional views and Data cubes
• Data warehouses and OLAP tools are based on a
multidimensional data model. This model views data in the
form of a data cube.
• What is a data cube?
– A data cube allows data to be modeled and viewed
in multiple dimensions. It is defined by dimensions
and facts.
July 2, 2019 77Compiled By: Kamal Acharya
Contd..
• Dimensions:
– dimensions are the perspectives or entities with respect to which an
organization wants to keep records.
– For example, AllElectronics may create a sales data warehouse in
order to keep records of the store’s sales with respect to the dimensions
time, item, branch, and location.
– These dimensions allow the store to keep track of things like monthly
sales of items and the branches and locations at which the items were
sold.
July 2, 2019 78Compiled By: Kamal Acharya
Contd..
• Facts:
– Facts are numeric measures.
• i.e., quantities by which we want to analyze relationships between
dimensions.
– Examples of facts for a sales data warehouse include dollars sold (sales
amount in dollars), units sold (number of units sold), and amount
budgeted.
July 2, 2019 79Compiled By: Kamal Acharya
Contd..
• usually cubes are 3-D geometric structures, but in data
warehousing the data cube is n-dimensional.
• a simple 2-D data cube: a table or spreadsheet
• E.g.,
July 2, 2019 80Compiled By: Kamal Acharya
Contd..
• 3-D data cube: a set of similarly structured 2-D tables stacked on top of one
another.
• E.g.,
July 2, 2019 81Compiled By: Kamal Acharya
Contd..
• The 3-D data in the table are represented as a series of 2-D tables called 3-D data cube,
as in Figure below.
• Fig: A 3-D data cube representation of the data in Table previous slide, according to
time, item, and location.
July 2, 2019 82Compiled By: Kamal Acharya
Contd..
• 4-D cubes: a 4-D cube is a series of 3-D cubes, as shown in Figure below:
• in this way, we may display any n-dimensional data as a series of (n-1)-
Dimensional “cubes.”
• Note: The data cube is a metaphor for multidimensional data storage. The
actual physical storage of such data may differ from its logical representation.
data cubes are n-dimensional and do not confine data to 3-D.
July 2, 2019 83Compiled By: Kamal Acharya
Data Cube implementation
• Efficient data cube computation:
– No Materialization
– Full Materialization
– Partial Materialization
• Access methods: How OLAP data can be indexed(Bit map and join indices)
• Query processing technique
• OLAP server types
– ROLAP
– MOLAP
– HOLAP
July 2, 2019 84Compiled By: Kamal Acharya
Computation of Data Cubes
• Data warehouses contain huge volumes of data.
• OLAP servers demand that decision support queries to answered in the
order of seconds.
• It is crucial for data warehouse systems to support highly efficient cube
computation techniques, access methods and query processing
techniques.
Efficient Computation of Data Cubes
• Cuboids:
– Data at different degree of summarization/ aggregations is often
referred to as a cuboid.
– Given a set of dimensions, we can generate a cuboid for each of the
possible subsets of the given dimensions.
– The result would form a lattice of cuboids, each showing the data at a
different level of summarization(or group-by/aggregation).
– The lattice of cuboids is then referred to as a data cube.
July 2, 2019 86Compiled By: Kamal Acharya
Contd..
• Example:
– Suppose that you would like to create a data cube for ALLElectronics
sales that contains : city, item, year, as the dimensions for the data cube
and sales_in_dollars as the measure.
– The possible group-by’s are the following:
• {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ( )},
• where ( ) means that the group-by is empty (i.e., the dimensions are not grouped).
– These group-by’s form a lattice of cuboids for the data cube, as shown
in Figure below.
July 2, 2019 87Compiled By: Kamal Acharya
Contd..
July 2, 2019 88Compiled By: Kamal Acharya
Contd..
• You would like to be able to analyze the data, with queries
such as following:
• “Compute the sum of sales, grouping by city and item.”
• “Compute the sum of sales, grouping by city.”
• “Compute the sum of sales, grouping by item.”
July 2, 2019 89Compiled By: Kamal Acharya
Contd..
• Special types of Cuboids:
– Base cuboid: The base cuboid contains all three dimensions,
city, item, and year. It can return the total sales for any
combination of the three dimensions. The base cuboid is the
least generalized (most specific) of the cuboids.
– Apex cuboid: The apex cuboid, or 0-D cuboid, refers to the
case where the group-by is empty. It contains the total sum of
all sales. The apex cuboid is the most generalized (least
specific) of the cuboids, and is often denoted as all.
July 2, 2019 90Compiled By: Kamal Acharya
• OLAP may need to access different cuboids for different queries.
• a good idea :
• compute all or at least some of the cuboids in a data cube in advance.
• Pre-computation leads to fast response time and
• avoids some redundant computation.
• But, required storage space may explode (due to pre-computation of all
cuboid, large number of dimensions and large number of concept
hierarchies of dimensins)
• This problem is referred to as the curse of dimensionality
Curse of Dimensionality
There are three choices for data cube materialization(computation of
cuboids) given a base cuboid:
1. No Materialization
2. Full Materialization
3. Partial Materialization
Data cube Materialization
•No Materialization
Do not pre-compute any of the “non-base” cuboid.
This leads to computing expensive multidimensional aggregates on the fly,
which can be extremely slow.
Contd..
•Full Materialization
•Pre-compute all of the cuboids.
•The resulting lattice of computed cuboids is referred to as the full cube.
•This choice typically requires huge amounts of memory space in order to
store all of the pre-computed cuboids.
Contd..
•Partial Materialization
Selectively compute a proper subset of the whole set of possible cuboids.
It represents an interesting trade-off between storage space and response
time.
The partial materialization of cuboids or sub-cubes should consider three
factors:
 Identify the subset of cuboids or sub-cubes to materialize
 Exploit the materialized cuboids or sub-cubes during query
processing
 Efficiently update the materialized cuboid or sub-cubes during load
and refresh.
Contd..
Indexing OLAP DATA
• To facilitate efficient data accessing to further speed up query
processing.
• Two most commonly used methods
• The Bitmap indexing method and
• Join indexing method
• Bitmap Indexing:
• In the bitmap index for a given attribute, there is a distinct bit vector, Bv,
for each value v in the domain of the attribute.
• If the attribute has the value v for a given row in the data table, then the bit
representing that value is set to 1 in the corresponding row of the bitmap
index. All other bits for that row are set to 0.
• Bitmap indexing reduces join, aggregation, and comparison operations to
bit arithmetic.
Contd..
Figure below shows a base (data) table containing the dimensions item and city,
and its mapping to bitmap index tables for each of the dimensions.
Contd..
• Join indexing:
• The join indexing method gained popularity from its use in
relational database query processing.
• Join indexing registers the joinable rows of two relations from a
relational database.
• Hence, the join index records can identify joinable tuples
without performing costly join operations.
Contd..
Example: join index relationship between the sales fact table and the location and
item dimension tables is shown in figure below
Contd..
Here, the “Main Street” value in the location dimension table joins with tuples T57,
T238, and T884 of the sales fact table.
Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and
T459 of the sales fact table.
Contd..
The corresponding join index tables are shown in Figure below.
•The purpose of materializing cuboids and constructing OLAP
index structures is to speed up query processing in data cubes.
Given materialized views, query processing should proceed as
follows:
1. Determine which operations should be performed on the available
cuboids.
2. Determine to which materialized cuboid(s) the relevant operations
should be applied.
Efficient Processing of OLAP Queries
Types of OLAP Servers
• OLAP servers present business users with multidimensional data from data
warehouses, without concerns regarding how or where the data are stored.
• However, the physical architecture and implementation of OLAP servers
must consider data storage issues.
• Implementations of a warehouse server for OLAP processing include the
following:
 Relational OLAP (ROLAP)
 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
July 2, 2019 103Compiled By: Kamal Acharya
Contd..
• Relational OLAP (ROLAP) Server:
– These are the intermediate servers that stand in between a relational back-
end server and client front-end tools.
– They use a relational or extended-relational DBMS to store and manage
warehouse data, and OLAP middleware to support missing pieces.
– ROLAP servers include optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools and
services.
– ROLAP technology tends to have greater scalability than MOLAP
technology.
July 2, 2019 104Compiled By: Kamal Acharya
Contd..
• Multidimensional OLAP (MOLAP) Server:
– These servers supports multidimensional views of data through array-based
multidimensional storage engines.
– They map multidimensional views directly to data cube array structures.
– The advantages of using a data cube is that it allows fast indexing to pre-
computed summarized data.
– In multidimensional data stores, the storage utilization may be low if the
data set is sparse.
July 2, 2019 105Compiled By: Kamal Acharya
Contd..
• Hybrid OLAP (HOLAP) Servers:
• The hybrid OLAP approach combines ROLAP and MOLAP technology.
• Benefiting from the greater scalability of ROLAP and the faster computation of
MOLAP.
July 2, 2019 106Compiled By: Kamal Acharya
Contd..
• MOLAP vs. ROLAP:
July 2, 2019 107Compiled By: Kamal Acharya
MOLAP ROLAP
Information retrieval is fast. Information retrieval is comparatively
slow.
Uses sparse array to store data-sets. Uses relational table.
MOLAP is best suited for
inexperienced users, since it is very
easy to use.
ROLAP is best suited for experienced
users.
Maintains a separate database for
data cubes.
It may not require space other than
available in the Data warehouse.
DBMS facility is weak. DBMS facility is strong.
Data Cube operations
• A number of operations may be applied to data cubes for
OLAP.
• Data cube operations are also known as OLAP operations. The
common ones are:
– Slice
– dice
– Roll-up(Drill-up)
– Drill-down(Roll-down)
– Pivot(Rotate)
July 2, 2019 108Compiled By: Kamal Acharya
Contd..
• The cube contains the dimensions location, time,
and item, where location is aggregated with
respect to city values, time is aggregated with
respect to quarters, and item is aggregated with
respect to item types.
– The measure displayed is dollars sold (in
thousands).
– The data examined are for the cities Chicago,
New York, Toronto, and Vancouver.
July 2, 2019 109Compiled By: Kamal Acharya
A data cube for AllElectronics sales to illustrate data cube operation:
Contd..
• Slice:
– Slice operation performs a
selection on one dimension
of the given cube, thus
creates subset a cube.
– Below example depicts how
slice operation works-
July 2, 2019 110Compiled By: Kamal Acharya
Contd..
• Dice:
– Dice operation performs a
selection on two or more
dimension from a given
cube and creates a sub-
cube.
– Below example depicts
how Dice operation works-
July 2, 2019 111Compiled By: Kamal Acharya
Contd..• Roll-up(Drill-up):
– The roll-up operation
performs aggregation
on a data cube, either :
• by climbing up a concept
hierarchy for a dimension
or
• by dimension reduction.
– Below example depicts
how roll-up operation
works-
July 2, 2019 112Compiled By: Kamal Acharya
Contd..• Drill-down(Roll-down):
– Drill-down is the reverse operation of
roll-up. It is performed by either of
the following ways:
• By stepping down a concept hierarchy for
a dimension
• By introducing a new dimension.
– It allows users to navigate among
different levels of data i.e. most
summarized (up) to most details
(down).
– Below example depicts how Drill-
down operation works
July 2, 2019 113Compiled By: Kamal Acharya
Contd..
• Pivot:
– Pivot also known as
rotation changes the
dimensional rotation of the
cube, i.e. rotates the axes to
view the data from
different perspectives. The
below cubes shows 2D
representation of Pivot
July 2, 2019 114Compiled By: Kamal Acharya
Guidelines for OLAP implementation
• A number of Guidelines for successful implementation of
OLAP are as follows:
– Vision
– Senior management support
– Selecting an OLAP tool
– Corporate strategy
– Focus on the users
– Joint management
– Review and adapt
July 2, 2019 115Compiled By: Kamal Acharya
Contd..
• Vision:
– The OLAP team must, in consultation with the users, develop a clear
vision for the OLAP system. This vision including the business
objectives should be clearly defined, understood, and shared by the
stakeholders.
• Senior management support:
– The OLAP project should fully supported by the senior managers,
since a data warehouse may have been developed already this should
not be difficult.
July 2, 2019 116Compiled By: Kamal Acharya
Contd..
• Selecting an OLAP tool:
– The OLAP team should familiarize themselves with the ROLP and
MOLAP tools available in the market. Since tools are quite different,
careful planning may be required in selecting a tool that is appropriate
for the enterprise. In some situations, a combination of ROLAP and
MOLAP may be most effective.
• Corporate strategy:
– The OLAP strategy should fit with the enterprise strategy and business
objectives. A good fit will result in the OLAP tools being used more
widely.
July 2, 2019 117Compiled By: Kamal Acharya
Contd..
• Focus on users:
– The OLAP project should be focused on users. Users should, in
consultation with the technical professionals, decide what tasks will be
done first and what will be done later. Attempts should be made to
provide each user with a tool suitable for that person’s skill level and
information needs. A good GUI user interface should be provided to
non-technical users. The project can only be successful whit the full
support of the users.
July 2, 2019 118Compiled By: Kamal Acharya
Contd..
• Joint Management:
– The OLAP project must be managed by both the IT and business
professional. Many other people should be involved in supplying ideas.
An appropriate committee structure may be necessary to channel these
ideas
• Review and adapt:
– Organizations evolve and so must be OLAP system. Regular reviews of
the project may be required to ensure that the project is meeting the
current needs of the enterprise.
July 2, 2019 119Compiled By: Kamal Acharya
Data Mining vs. OLAP
• OLAP - Online Analytical Processing
– Provides you with a very good view of what is
happening, but can not predict what will happen in
the future or why it is happening.
• Data Mining is a combination of discovering techniques +
prediction techniques
July 2, 2019 120Compiled By: Kamal Acharya
Home Work
• What are dimension, members, measure and fact table?
• List the major difference between OLTP systems and OLAP systems.
• What is OLAP and its purpose? List the characteristics of OLAP systems.
• What is data cube and purpose of data cube? Use an example to illustrate
the use of data cube.
• What are ROLAP and MOLAP ?describe the two approaches and list their
advantages and disadvantages.
• Describe the operations(OLAP/ Cube operations) roll-up, drill-down, and
slice and dice.
• List the implementation guidelines for implementing OLAP.
July 2, 2019 121Compiled By: Kamal Acharya
Thank You !
Compiled By: Kamal Acharya 122July 2, 2019

More Related Content

What's hot

Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
Krish_ver2
 
KNN
KNN KNN
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
Krish_ver2
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
Krish_ver2
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
varshakumar21
 
Data cubes
Data cubesData cubes
Data cubes
Mohammed
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
Chathurangi Shyalika
 
Data cube
Data cubeData cube
Data cube
Hitesh Mohapatra
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Kamalakshi Deshmukh-Samag
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
Azad public school
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
Krish_ver2
 
Knowledge Representation & Reasoning
Knowledge Representation & ReasoningKnowledge Representation & Reasoning
Knowledge Representation & Reasoning
Sajid Marwat
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
Shuvra Ghosh
 

What's hot (20)

Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
KNN
KNN KNN
KNN
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Data cubes
Data cubesData cubes
Data cubes
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
 
Data cube
Data cubeData cube
Data cube
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
Knowledge Representation & Reasoning
Knowledge Representation & ReasoningKnowledge Representation & Reasoning
Knowledge Representation & Reasoning
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 

Similar to Data Preprocessing

BIM Data Mining Unit2 by Tekendra Nath Yogi
 BIM Data Mining Unit2 by Tekendra Nath Yogi BIM Data Mining Unit2 by Tekendra Nath Yogi
BIM Data Mining Unit2 by Tekendra Nath Yogi
Tekendra Nath Yogi
 
Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).ppt
MdZahidHasan55
 
Data types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptxData types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptx
RupaRaj6
 
Lecture2.pdf
Lecture2.pdfLecture2.pdf
Lecture2.pdf
MuhammadAhmedSuhail
 
Sigma Kappa Donor Engagement Case Study
Sigma Kappa Donor Engagement Case StudySigma Kappa Donor Engagement Case Study
Sigma Kappa Donor Engagement Case Study
iMIS
 
Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big data
Seta Wicaksana
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 
Understanding big data and data analytics - Data Mining
Understanding big data and data analytics - Data MiningUnderstanding big data and data analytics - Data Mining
Understanding big data and data analytics - Data Mining
Seta Wicaksana
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
extraganesh
 
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.pptlghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
JITENDER773791
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
anjanishah774
 
lect1.ppt
lect1.pptlect1.ppt
lect1.ppt
ssuserb26f53
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
DEEPAK948083
 
Data Science Chapter 2.pdf
Data Science Chapter 2.pdfData Science Chapter 2.pdf
Data Science Chapter 2.pdf
Mpumelelo Ndlovu
 
OLAP technology
OLAP technologyOLAP technology
OLAP technology
Dr. Mahendra Srivastava
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
DataPreprocessing.ppt
DataPreprocessing.pptDataPreprocessing.ppt
DataPreprocessing.ppt
TesfahunAsmare1
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
ImXaib
 
Data mining Basics and complete description
Data mining Basics and complete description Data mining Basics and complete description
Data mining Basics and complete description
Sulman Ahmed
 
Advance Data Mining - Machine Learning -
Advance Data Mining - Machine Learning -Advance Data Mining - Machine Learning -
Advance Data Mining - Machine Learning -
MuhammadHaroon20656
 

Similar to Data Preprocessing (20)

BIM Data Mining Unit2 by Tekendra Nath Yogi
 BIM Data Mining Unit2 by Tekendra Nath Yogi BIM Data Mining Unit2 by Tekendra Nath Yogi
BIM Data Mining Unit2 by Tekendra Nath Yogi
 
Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).ppt
 
Data types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptxData types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptx
 
Lecture2.pdf
Lecture2.pdfLecture2.pdf
Lecture2.pdf
 
Sigma Kappa Donor Engagement Case Study
Sigma Kappa Donor Engagement Case StudySigma Kappa Donor Engagement Case Study
Sigma Kappa Donor Engagement Case Study
 
Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big data
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Understanding big data and data analytics - Data Mining
Understanding big data and data analytics - Data MiningUnderstanding big data and data analytics - Data Mining
Understanding big data and data analytics - Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.pptlghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
 
lect1.ppt
lect1.pptlect1.ppt
lect1.ppt
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
 
Data Science Chapter 2.pdf
Data Science Chapter 2.pdfData Science Chapter 2.pdf
Data Science Chapter 2.pdf
 
OLAP technology
OLAP technologyOLAP technology
OLAP technology
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
DataPreprocessing.ppt
DataPreprocessing.pptDataPreprocessing.ppt
DataPreprocessing.ppt
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Data mining Basics and complete description
Data mining Basics and complete description Data mining Basics and complete description
Data mining Basics and complete description
 
Advance Data Mining - Machine Learning -
Advance Data Mining - Machine Learning -Advance Data Mining - Machine Learning -
Advance Data Mining - Machine Learning -
 

More from Kamal Acharya

Programming the basic computer
Programming the basic computerProgramming the basic computer
Programming the basic computer
Kamal Acharya
 
Computer Arithmetic
Computer ArithmeticComputer Arithmetic
Computer Arithmetic
Kamal Acharya
 
Introduction to Computer Security
Introduction to Computer SecurityIntroduction to Computer Security
Introduction to Computer Security
Kamal Acharya
 
Session and Cookies
Session and CookiesSession and Cookies
Session and Cookies
Kamal Acharya
 
Functions in php
Functions in phpFunctions in php
Functions in php
Kamal Acharya
 
Web forms in php
Web forms in phpWeb forms in php
Web forms in php
Kamal Acharya
 
Making decision and repeating in PHP
Making decision and repeating  in PHPMaking decision and repeating  in PHP
Making decision and repeating in PHP
Kamal Acharya
 
Working with arrays in php
Working with arrays in phpWorking with arrays in php
Working with arrays in php
Kamal Acharya
 
Text and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPText and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHP
Kamal Acharya
 
Introduction to PHP
Introduction to PHPIntroduction to PHP
Introduction to PHP
Kamal Acharya
 
Capacity Planning of Data Warehousing
Capacity Planning of Data WarehousingCapacity Planning of Data Warehousing
Capacity Planning of Data Warehousing
Kamal Acharya
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
Kamal Acharya
 
Search Engines
Search EnginesSearch Engines
Search Engines
Kamal Acharya
 
Web Mining
Web MiningWeb Mining
Web Mining
Kamal Acharya
 
Information Privacy and Data Mining
Information Privacy and Data MiningInformation Privacy and Data Mining
Information Privacy and Data Mining
Kamal Acharya
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
Kamal Acharya
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
Kamal Acharya
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
Kamal Acharya
 
Introduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data WarehousingIntroduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data Warehousing
Kamal Acharya
 
Functions in Python
Functions in PythonFunctions in Python
Functions in Python
Kamal Acharya
 

More from Kamal Acharya (20)

Programming the basic computer
Programming the basic computerProgramming the basic computer
Programming the basic computer
 
Computer Arithmetic
Computer ArithmeticComputer Arithmetic
Computer Arithmetic
 
Introduction to Computer Security
Introduction to Computer SecurityIntroduction to Computer Security
Introduction to Computer Security
 
Session and Cookies
Session and CookiesSession and Cookies
Session and Cookies
 
Functions in php
Functions in phpFunctions in php
Functions in php
 
Web forms in php
Web forms in phpWeb forms in php
Web forms in php
 
Making decision and repeating in PHP
Making decision and repeating  in PHPMaking decision and repeating  in PHP
Making decision and repeating in PHP
 
Working with arrays in php
Working with arrays in phpWorking with arrays in php
Working with arrays in php
 
Text and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPText and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHP
 
Introduction to PHP
Introduction to PHPIntroduction to PHP
Introduction to PHP
 
Capacity Planning of Data Warehousing
Capacity Planning of Data WarehousingCapacity Planning of Data Warehousing
Capacity Planning of Data Warehousing
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Information Privacy and Data Mining
Information Privacy and Data MiningInformation Privacy and Data Mining
Information Privacy and Data Mining
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Introduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data WarehousingIntroduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data Warehousing
 
Functions in Python
Functions in PythonFunctions in Python
Functions in Python
 

Recently uploaded

How to Create a Stage or a Pipeline in Odoo 17 CRM
How to Create a Stage or a Pipeline in Odoo 17 CRMHow to Create a Stage or a Pipeline in Odoo 17 CRM
How to Create a Stage or a Pipeline in Odoo 17 CRM
Celine George
 
Non-Verbal Communication for Tech Professionals
Non-Verbal Communication for Tech ProfessionalsNon-Verbal Communication for Tech Professionals
Non-Verbal Communication for Tech Professionals
MattVassar1
 
8+8+8 Rule Of Time Management For Better Productivity
8+8+8 Rule Of Time Management For Better Productivity8+8+8 Rule Of Time Management For Better Productivity
8+8+8 Rule Of Time Management For Better Productivity
RuchiRathor2
 
Erasmus + DISSEMINATION ACTIVITIES Croatia
Erasmus + DISSEMINATION ACTIVITIES CroatiaErasmus + DISSEMINATION ACTIVITIES Croatia
Erasmus + DISSEMINATION ACTIVITIES Croatia
whatchangedhowreflec
 
Science-9-Lesson-1-The Bohr Model-NLC.pptx pptx
Science-9-Lesson-1-The Bohr Model-NLC.pptx pptxScience-9-Lesson-1-The Bohr Model-NLC.pptx pptx
Science-9-Lesson-1-The Bohr Model-NLC.pptx pptx
Catherine Dela Cruz
 
220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology
Kalna College
 
(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"
MJDuyan
 
Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024
Friends of African Village Libraries
 
How to stay relevant as a cyber professional: Skills, trends and career paths...
How to stay relevant as a cyber professional: Skills, trends and career paths...How to stay relevant as a cyber professional: Skills, trends and career paths...
How to stay relevant as a cyber professional: Skills, trends and career paths...
Infosec
 
78 Microsoft-Publisher - Sirin Sultana Bora.pptx
78 Microsoft-Publisher - Sirin Sultana Bora.pptx78 Microsoft-Publisher - Sirin Sultana Bora.pptx
78 Microsoft-Publisher - Sirin Sultana Bora.pptx
Kalna College
 
What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17
Celine George
 
IoT (Internet of Things) introduction Notes.pdf
IoT (Internet of Things) introduction Notes.pdfIoT (Internet of Things) introduction Notes.pdf
IoT (Internet of Things) introduction Notes.pdf
roshanranjit222
 
Opportunity scholarships and the schools that receive them
Opportunity scholarships and the schools that receive themOpportunity scholarships and the schools that receive them
Opportunity scholarships and the schools that receive them
EducationNC
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapitolTechU
 
Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024
khabri85
 
220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx
Kalna College
 
Interprofessional Education Platform Introduction.pdf
Interprofessional Education Platform Introduction.pdfInterprofessional Education Platform Introduction.pdf
Interprofessional Education Platform Introduction.pdf
Ben Aldrich
 
Diversity Quiz Finals by Quiz Club, IIT Kanpur
Diversity Quiz Finals by Quiz Club, IIT KanpurDiversity Quiz Finals by Quiz Club, IIT Kanpur
Diversity Quiz Finals by Quiz Club, IIT Kanpur
Quiz Club IIT Kanpur
 
nutrition in plants chapter 1 class 7...
nutrition in plants chapter 1 class 7...nutrition in plants chapter 1 class 7...
nutrition in plants chapter 1 class 7...
chaudharyreet2244
 
220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science
Kalna College
 

Recently uploaded (20)

How to Create a Stage or a Pipeline in Odoo 17 CRM
How to Create a Stage or a Pipeline in Odoo 17 CRMHow to Create a Stage or a Pipeline in Odoo 17 CRM
How to Create a Stage or a Pipeline in Odoo 17 CRM
 
Non-Verbal Communication for Tech Professionals
Non-Verbal Communication for Tech ProfessionalsNon-Verbal Communication for Tech Professionals
Non-Verbal Communication for Tech Professionals
 
8+8+8 Rule Of Time Management For Better Productivity
8+8+8 Rule Of Time Management For Better Productivity8+8+8 Rule Of Time Management For Better Productivity
8+8+8 Rule Of Time Management For Better Productivity
 
Erasmus + DISSEMINATION ACTIVITIES Croatia
Erasmus + DISSEMINATION ACTIVITIES CroatiaErasmus + DISSEMINATION ACTIVITIES Croatia
Erasmus + DISSEMINATION ACTIVITIES Croatia
 
Science-9-Lesson-1-The Bohr Model-NLC.pptx pptx
Science-9-Lesson-1-The Bohr Model-NLC.pptx pptxScience-9-Lesson-1-The Bohr Model-NLC.pptx pptx
Science-9-Lesson-1-The Bohr Model-NLC.pptx pptx
 
220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology220711130097 Tulip Samanta Concept of Information and Communication Technology
220711130097 Tulip Samanta Concept of Information and Communication Technology
 
(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"(T.L.E.) Agriculture: "Ornamental Plants"
(T.L.E.) Agriculture: "Ornamental Plants"
 
Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024Library news letter Kitengesa Uganda June 2024
Library news letter Kitengesa Uganda June 2024
 
How to stay relevant as a cyber professional: Skills, trends and career paths...
How to stay relevant as a cyber professional: Skills, trends and career paths...How to stay relevant as a cyber professional: Skills, trends and career paths...
How to stay relevant as a cyber professional: Skills, trends and career paths...
 
78 Microsoft-Publisher - Sirin Sultana Bora.pptx
78 Microsoft-Publisher - Sirin Sultana Bora.pptx78 Microsoft-Publisher - Sirin Sultana Bora.pptx
78 Microsoft-Publisher - Sirin Sultana Bora.pptx
 
What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17What are the new features in the Fleet Odoo 17
What are the new features in the Fleet Odoo 17
 
IoT (Internet of Things) introduction Notes.pdf
IoT (Internet of Things) introduction Notes.pdfIoT (Internet of Things) introduction Notes.pdf
IoT (Internet of Things) introduction Notes.pdf
 
Opportunity scholarships and the schools that receive them
Opportunity scholarships and the schools that receive themOpportunity scholarships and the schools that receive them
Opportunity scholarships and the schools that receive them
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
 
Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024Brand Guideline of Bashundhara A4 Paper - 2024
Brand Guideline of Bashundhara A4 Paper - 2024
 
220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx220711130088 Sumi Basak Virtual University EPC 3.pptx
220711130088 Sumi Basak Virtual University EPC 3.pptx
 
Interprofessional Education Platform Introduction.pdf
Interprofessional Education Platform Introduction.pdfInterprofessional Education Platform Introduction.pdf
Interprofessional Education Platform Introduction.pdf
 
Diversity Quiz Finals by Quiz Club, IIT Kanpur
Diversity Quiz Finals by Quiz Club, IIT KanpurDiversity Quiz Finals by Quiz Club, IIT Kanpur
Diversity Quiz Finals by Quiz Club, IIT Kanpur
 
nutrition in plants chapter 1 class 7...
nutrition in plants chapter 1 class 7...nutrition in plants chapter 1 class 7...
nutrition in plants chapter 1 class 7...
 
220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science220711130082 Srabanti Bag Internet Resources For Natural Science
220711130082 Srabanti Bag Internet Resources For Natural Science
 

Data Preprocessing

  • 1. Unit: 2 Data Preprocessing
  • 2. Outline of the chapter • Data types and attribute types • Data pre- processing • OLAP • Characteristics of OLAP Systems • Multidimensional views and data cubes • Data cube implementations • Data cube operations • Guidelines for OLAP Implementation. July 2, 2019 2Compiled By: Kamal Acharya
  • 3. 2.1. Data types and attribute types • Outline: – Attributes and Objects – Types of Data July 2, 2019 3Compiled By: Kamal Acharya
  • 4. What is Data? • Collection of data objects and their attributes • An attribute is a property or characteristic of an object – Examples: eye color of a person, temperature, etc. – Attribute is also known as variable, field, characteristic, dimension, or feature • A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes/ dimension Objects
  • 5. Attribute Values • Attribute values are numbers or symbols assigned to an attribute for a particular object • Distinction between attributes and attribute values – Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters – Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers • But properties of attribute values can be different July 2, 2019 5Compiled By: Kamal Acharya
  • 6. Types of Attributes • The type of attribute is determined by the set of possible values the attribute can have. • There are different types of attributes: – Nominal attributes – Binary attributes – Ordinal attributes – Numeric attributes • Interval-scaled attributes • Ratio-scaled attributes July 2, 2019 6Compiled By: Kamal Acharya
  • 7. Contd.. • Nominal Attributes: – Nominal means “relating to names.” – The values of a nominal attribute are symbols or names of things. – Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical. – The values do not have any meaningful order. – E.g., : Hair_color = { black, brown, grey, red, white, etc} Marital _status= { single, married, divorced} July 2, 2019 7Compiled By: Kamal Acharya
  • 8. Contd.. • Binary Attributes: – A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present. – Symmetric binary: both outcomes equally important • e.g., gender = {male ,female} – Asymmetric binary: outcomes not equally important. • e.g., medical test (positive vs. negative) • Convention: assign 1 to most important outcome(rarest) (e.g., HIV positive) and other by 0(e.g., HIV negative) July 2, 2019 8Compiled By: Kamal Acharya
  • 9. Contd.. • Ordinal Attribute: – An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. – E.g.: suppose that drink size corresponds to the size of drinks available at a fast-food restaurant. This nominal attribute has three possible values: small, medium, and large. The values have a meaningful sequence (which corresponds to increasing drink size); however, we cannot tell from the values how much bigger, say, a medium is than a large. July 2, 2019 9Compiled By: Kamal Acharya
  • 10. Contd.. • Numeric Attributes: – It is a measurable quantity, represented in integer or real values. – Numeric attributes can be interval-scaled or ratio-scaled. – Interval-scaled : • Interval-scaled attributes are measured on a scale of equal-size units. • E.g.: calendar dates, temperature in Celsius – Ratio-scaled attributes: • a value as being a multiple (or ratio) of another value. • it has a zero point or character of origin • Ratio are meaningful • examples are height, weight, money, age July 2, 2019 10Compiled By: Kamal Acharya
  • 11. Discrete vs. Continuous Attributes • There are many ways to organize attribute types. • Discrete Attribute – Has only a finite or countably infinite set of values • E.g., Roll number, the set of words in a collection of documents – Note: Binary attributes are a special case of discrete attributes • Continuous Attribute – Has real numbers as attribute values • E.g., temperature, Speed, etc – Continuous attributes are typically represented as floating-point variables July 2, 2019 11Compiled By: Kamal Acharya
  • 12. Properties of Attribute Values • Distinctness: = ≠ • Order: < > • Addition: + - • Multiplication: * / – Nominal attribute: distinctness – Ordinal attribute: distinctness & order – Interval attribute: distinctness, order & addition – Ratio attribute: all 4 properties July 2, 2019 Compiled By: Kamal Acharya 12
  • 13. Types of data sets • Record – Data Matrix – Document Data – Transaction Data • Graph – World Wide Web – Generic graph – Social or information networks • Ordered – Spatial Data – Temporal Data – Sequential Data July 2, 2019 13Compiled By: Kamal Acharya
  • 14. Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 July 2, 2019 14Compiled By: Kamal Acharya
  • 15. Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute 1.12.216.226.2512.65 1.22.715.225.2710.23 ThicknessLoadDistanceProjection of y load Projection of x Load 1.12.216.226.2512.65 1.22.715.225.2710.23 ThicknessLoadDistanceProjection of y load Projection of x Load July 2, 2019 15Compiled By: Kamal Acharya
  • 16. Document Data • Each document becomes a ‘term’ vector – Each term is a component (attribute) of the vector – The value of each component is the number of times the corresponding term occurs in the document. Document 1 season timeout lost win game score ball play coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0 July 2, 2019 16Compiled By: Kamal Acharya
  • 17. Transaction Data • A special type of record data, where – Each record (transaction) involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk July 2, 2019 17Compiled By: Kamal Acharya
  • 18. Graph Data • Examples: – Generic graph – World-wide web – Social or information networks 5 2 1 2 5 July 2, 2019 18Compiled By: Kamal Acharya
  • 19. Ordered Data • Video data: sequence of images • Temporal data: time-series • Sequential Data: transaction sequences July 2, 2019 19Compiled By: Kamal Acharya
  • 20. 2.2. Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary July 2, 2019 20Compiled By: Kamal Acharya
  • 21. Why Data Preprocessing? • Data in the real world is dirty – incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data • e.g., occupation=“ ” – noisy: containing errors or outliers • e.g., Salary=“-10” – inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1, 2, 3”, now rating “A, B, C” July 2, 2019 21Compiled By: Kamal Acharya
  • 22. Why Is Data Preprocessing Important? • To make data more suitable for data mining. • To improve the data mining analysis with respect to time, cost and quality. • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data mining example: • a classification model for detecting people who are loan risks is built using poor data – Some credit-worthy candidates are denied loans – More loans are given to individuals that default July 2, 2019 22Compiled By: Kamal Acharya
  • 23. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Part of data reduction but with particular importance, especially for numerical data July 2, 2019 23Compiled By: Kamal Acharya
  • 24. Contd.. July 2, 2019 24Compiled By: Kamal Acharya
  • 25. Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary July 2, 2019 25Compiled By: Kamal Acharya
  • 26. Data Cleaning • If data is dirty(incomplete, noisy, inconsistent) then: – Users can not trust any results of data mining – Can cause confusion for the data mining procedure, resulting in unreliable output. • So, data cleaning is required. • To clean data the following data cleaning tasks are performed: – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data July 2, 2019 26Compiled By: Kamal Acharya
  • 27. Missing Data • Data is not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to: – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry • Missing data may need to be inferred. July 2, 2019 27Compiled By: Kamal Acharya
  • 28. How to Handle Missing Data? • Ignore the tuple: – usually done when class label is missing • Fill in the missing value manually: – tedious + infeasible? • Use a global constant to fill in the missing value: – e.g., “unknown” • Use the attribute mean to fill in the missing value • Use the most probable value to fill in the missing value: July 2, 2019 28Compiled By: Kamal Acharya
  • 29. Noisy Data • Noise: – random error or variance in a measured variable • Noise (Incorrect attribute) values may due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention July 2, 2019 29Compiled By: Kamal Acharya
  • 30. How to Handle Noisy Data? • Binning method: – first sort data and partition into bins – then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Clustering – detect and remove outliers • Combined computer and human inspection – detect doubtful values and check by human • Regression – smooth by fitting the data into regression functions July 2, 2019 30Compiled By: Kamal Acharya
  • 31. Binning • Three step process: – Sort the data – Make the bins by partitioning – Smooth the data in each bins July 2, 2019 31Compiled By: Kamal Acharya
  • 32. Contd… • Partitioning techniques to make bins: – Equal-width (distance) partitioning: – Equal-depth (frequency) partitioning July 2, 2019 32Compiled By: Kamal Acharya
  • 33. Contd… – Equal-width (distance) partitioning: • It divides the range into N intervals of equal size • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. – Equal-depth (frequency) partitioning: • It divides the range into N intervals, each containing approximately same number of samples July 2, 2019 33Compiled By: Kamal Acharya
  • 34. Example: Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 July 2, 2019 34Compiled By: Kamal Acharya
  • 35. Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary July 2, 2019 35Compiled By: Kamal Acharya
  • 36. Data Integration • Data integration combines data from multiple sources into a coherent store(e.g., DW). • Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. • This can help improve the accuracy and speed of the subsequent data mining process. July 2, 2019 36Compiled By: Kamal Acharya
  • 37. Contd.. • But, Entity identification problem: • for the same real world entity, attribute values from different sources are different. • For example: how can the data analyst or the computer be sure that customer id in one database and cust_number in another refer to the same attribute? • Solution: The Meta data can be used to help the transformation of data July 2, 2019 37Compiled By: Kamal Acharya
  • 38. Contd.. • Handling Redundant Data: • Redundant data occur often when integration of multiple databases – The same attribute may have different names in different databases Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality July 2, 2019 38Compiled By: Kamal Acharya
  • 39. Data Transformation • In this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand. • Data Transformation Strategies: – Smoothing – Attribute construction – Aggregation – Normalization – Discretization – Concept hierarchy generation July 2, 2019 39Compiled By: Kamal Acharya
  • 40. Contd.. • Smoothing, which works to remove noise from the data. Techniques include binning, regression, and clustering. • Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process. • Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. July 2, 2019 40Compiled By: Kamal Acharya
  • 41. Contd.. • Normalization, where the attribute data are scaled so as to fall within a smaller range, such as: -1.0 to 1.0, or 0.0 to 1.0. • Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). • Concept hierarchy generation, where attributes such as street can be generalized to higher-level concepts, like city or country. July 2, 2019 41Compiled By: Kamal Acharya
  • 42. Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary July 2, 2019 42Compiled By: Kamal Acharya
  • 43. Data Reduction • Warehouse may store terabytes of data: – Complex data mining may take a very long time to run on the complete data set. • Data reduction – Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results. July 2, 2019 43Compiled By: Kamal Acharya
  • 44. Data Reduction Strategies • Data reduction strategies – Data cube aggregation – Dimensionality reduction – Histograms – clustering – sampling – Discretization and concept hierarchy generation July 2, 2019 44Compiled By: Kamal Acharya
  • 45. Contd.. • Data Cube Aggregation: – Data for sales per quarter, for the years 2008 to 2010. – interested in the annual sales (total per year), rather than the total per quarter. – Thus, the data can be aggregated so that the resulting data summarize the total sales per year instead of per quarter. – The resulting data set is smaller in volume, without loss of information necessary for the analysis task. – This aggregation is illustrated in Figure below. July 2, 2019 45Compiled By: Kamal Acharya
  • 46. Contd.. – Dimensionality Reduction: Feature selection (i.e., attribute subset selection): • Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features • reduce # of patterns in the patterns, easier to understand July 2, 2019 46Compiled By: Kamal Acharya
  • 47. Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000 July 2, 2019 47Compiled By: Kamal Acharya
  • 48. Clustering • Partition data set into clusters, and one can store cluster representation only. July 2, 2019 48Compiled By: Kamal Acharya
  • 49. Sampling • Sampling is the main technique employed for data reduction. – It is often used for both the preliminary investigation of the data and the final data analysis. • Statisticians often sample because obtaining the entire set of data of interest is too expensive or time consuming. • Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming. July 2, 2019 49Compiled By: Kamal Acharya
  • 50. Contd… • The key principle for effective sampling is the following: – Using a sample will work almost as well as using the entire data set, if the sample is representative – A sample is representative if it has approximately the same properties (of interest) as the original set of data July 2, 2019 50Compiled By: Kamal Acharya
  • 51. Contd.. • Discretization – reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. – Interval labels can then be used to replace actual data values. • Concept hierarchies – reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). July 2, 2019 51Compiled By: Kamal Acharya
  • 52. Summary • Data preparation is a big issue for both warehousing and mining • Data preparation includes – Data cleaning and data integration – Data reduction and feature selection – Discretization • A lot of methods have been developed but still an active area of research July 2, 2019 52Compiled By: Kamal Acharya
  • 53. Data Warehouse: A Multi-Tiered Architecture Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Operational DBs Other sources Data Storage OLAP Server July 2, 2019 53Compiled By: Kamal Acharya
  • 54. OLAP • OLAP is a software technology concerned with fast analysis of enterprise information. • Often OLAP systems are data warehouse front end software tools to make aggregate data available efficiently to an enterprise’s decision makers (analysts, managers and executives). • Major OLAP applications are trend analysis over a number of time periods, slicing, dicing , drill-down and roll-up to look at the data at different levels of detail and pivoting or rotating to obtain a new multidimensional view. July 2, 2019 54Compiled By: Kamal Acharya
  • 55. Characteristics of OLAP Systems • Comparison between OLTP and OLAP systems: – This comparison highlights some of the characteristics of OLAP systems. – The difference between the two types of systems are as follows: – Users: • OLTP systems are designed for office workers while the OLAP systems are designed for decision makers. Therefore , while an OLTP system may be accessed by hundreds or even thousands of users in a large enterprise, an OLAP system is likely to be accessed only by a selected group of managers and may be used by dozens of users. July 2, 2019 55Compiled By: Kamal Acharya
  • 56. Contd.. • Functions: – OLTP systems are mission-critical (vital to the functioning of an organization.). These systems carry out simple repetitive operations. – OLAP systems on the other hand are management critical to support an enterprise's decision support functions using analytical investigations. These are ad-hoc and often much more complex operations. July 2, 2019 56Compiled By: Kamal Acharya
  • 57. Contd.. • Nature: – Nature of queries in OLTP system is simple – Nature of queries in OLAP system is complex – Nature of usage of OLTP system is repetitive – Nature of usage of OLAP system is mostly ad hoc July 2, 2019 57Compiled By: Kamal Acharya
  • 58. Contd.. • Design: – OLTP systems are designed to be application-oriented while OLAP systems are designed to be subject-oriented. – OLTP systems view the enterprise data as a collection of tables while OLAP systems view enterprise information as multidimensional. July 2, 2019 58Compiled By: Kamal Acharya
  • 59. Contd.. • Data: – OLTP systems normally deal only with the current status of information. – On the other hand, OLAP systems require historical data over several years since trend are often important in decision making. July 2, 2019 59Compiled By: Kamal Acharya
  • 60. Contd.. • Kinds of use: – OLTP systems are used for read and write operations while OLAP systems normally do not update the data but refresh the data. July 2, 2019 60Compiled By: Kamal Acharya
  • 61. Contd.. • Other features that distinguish between OLTP and OLAP system are summarized in the following table: July 2, 2019 61Compiled By: Kamal Acharya
  • 62. FASMI Characteristics of OLAP systems • The FASMI characteristics of OLAP systems, the name derived from the first letters of the characteristics, are: – Fast – Analytic – Shared – Multidimensional – Information July 2, 2019 62Compiled By: Kamal Acharya
  • 63. Contd.. • Fast: – OLAP queries should be answered very quickly, perhaps within seconds. – To achieve such performance: • the data structure must be efficient and the hardware must be powerful. • Full pre-computation of aggregates • Pre-compute the most commonly queried aggregates. July 2, 2019 63Compiled By: Kamal Acharya
  • 64. Contd.. • Analytic: – An OLAP system must provide rich analytic functionality and it is expected that most OLAP queries can be answered without any programming. – The system should be able to cope with any relevant queries for the application and the user. July 2, 2019 64Compiled By: Kamal Acharya
  • 65. Contd.. • Shared: – An OLAP system is a shared resource although it is unlikely to be shared by hundreds of users. – An OLAP system is likely to be accessed only by a selected group of managers and may be use by mere dozens of users. – Being a shared system, an OLAP system should provide adequate security for confidentiality as well as integrity. July 2, 2019 65Compiled By: Kamal Acharya
  • 66. Contd.. • Multidimensional: – This is the basic requirement. – Whatever OLAP software is being used, it must provide a multidimensional conceptual view of the data. July 2, 2019 66Compiled By: Kamal Acharya
  • 67. Contd.. • Information: – OLAP systems usually obtain information from a data warehouse. – The system should be able to handle a large amount of input data. – The capacity of an OLAP system to handle information and its integration with the data warehouse may be critical. July 2, 2019 67Compiled By: Kamal Acharya
  • 68. Codd’s OLAP characteristics • The most important characteristics of OLAP systems provided by the Codd are as follows: – Multidimensional conceptual view – Accessibility(OLAP as a mediator) – Batch extraction vs interpretive – Multi-user support – Storing OLAP result – Extraction of missing values – Treatment of missing values – Uniform reporting performance – Generic dimensionality – Unlimited dimensions and aggregation levels July 2, 2019 68Compiled By: Kamal Acharya
  • 69. Contd.. • Multidimensional conceptual view: – By requiring a multidimensional view, it is possible to carry out operations like slice and dice. • Accessibility (OLAP as a mediator): – The OLAP software should be sitting between data sources(e..g., a data warehouse) and an OLAP front- end. July 2, 2019 69Compiled By: Kamal Acharya
  • 70. Contd.. • Batch extraction versus interpretive: – An OLAP system should provide multidimensional data staging plus partial pre-calculation of aggregates in large multidimensional databases. • Multi- user support: – Since the OLAP system is shared, the OLAP software should provide many normal database operations including retrieval, update, concurrency control, integrity and security. July 2, 2019 70Compiled By: Kamal Acharya
  • 71. Contd.. • Storing OLAP results: – OLAP results data should be kept separate from source data. • Extraction of missing values: – The OLAP system should distinguish missing values form zero values. – A large data cube may have a large number of zeros as well as some missing values. – If a distinction is not made between zero values and missing values, the aggregates are likely to be computed incorrectly. July 2, 2019 71Compiled By: Kamal Acharya
  • 72. Contd.. • Treatment of missing values: – An OLAP system should ignore all missing values regardless of their source. – Correct aggregate values will be computed once the missing values are ignored. • Uniform reporting performance: – Increasing the number of dimensions or database size should not significantly degrade the reporting performance of the OLAP system. – This is good objective although it may be difficult to achieve in practice. July 2, 2019 72Compiled By: Kamal Acharya
  • 73. Contd.. • Generic dimensionality: – An OLAP system should treat each dimension as equivalent in both its structure and operational capabilities. Additional operational capabilities may be granted to be selected dimensions but such additional functions should be grantable to be any dimension • Unlimited dimensions and aggregation levels: – An OLAP system should allow unlimited dimensions and aggregations and aggregation levels. – but In practice, this is undesirable. July 2, 2019 73Compiled By: Kamal Acharya
  • 74. Example OLAPApplications • Understanding and improving sales: – OLAP can assist in finding the most popular products and the most popular channels for selling the products. • E,g., Findwhich itemsare frequentlysoldover the summerbut not overwinter? – OLAP can assist in finding most profitable customers. July 2, 2019 74Compiled By: Kamal Acharya
  • 75. Contd.. • Understanding and reducing costs of doing business: – Improving sales is one aspect of improving business, the other aspect is to analyze costs and to control them as much as possible without affecting sales. – OLAP can assist in analyzing the costs associated with sales. – In some cases, it may also be possible to identify expenditures that produce a high return on investment. July 2, 2019 75Compiled By: Kamal Acharya
  • 76. Contd.. • CreditCard Companies: • Given a new applicant, does (s)hea credit-worthy? • Need to check other similarapplicants (age, gender, income, etc…) and observehow theyperform,then do prediction for newapplicant July 2, 2019 76Compiled By: Kamal Acharya
  • 77. Multi-dimensional views and Data cubes • Data warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form of a data cube. • What is a data cube? – A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. July 2, 2019 77Compiled By: Kamal Acharya
  • 78. Contd.. • Dimensions: – dimensions are the perspectives or entities with respect to which an organization wants to keep records. – For example, AllElectronics may create a sales data warehouse in order to keep records of the store’s sales with respect to the dimensions time, item, branch, and location. – These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations at which the items were sold. July 2, 2019 78Compiled By: Kamal Acharya
  • 79. Contd.. • Facts: – Facts are numeric measures. • i.e., quantities by which we want to analyze relationships between dimensions. – Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and amount budgeted. July 2, 2019 79Compiled By: Kamal Acharya
  • 80. Contd.. • usually cubes are 3-D geometric structures, but in data warehousing the data cube is n-dimensional. • a simple 2-D data cube: a table or spreadsheet • E.g., July 2, 2019 80Compiled By: Kamal Acharya
  • 81. Contd.. • 3-D data cube: a set of similarly structured 2-D tables stacked on top of one another. • E.g., July 2, 2019 81Compiled By: Kamal Acharya
  • 82. Contd.. • The 3-D data in the table are represented as a series of 2-D tables called 3-D data cube, as in Figure below. • Fig: A 3-D data cube representation of the data in Table previous slide, according to time, item, and location. July 2, 2019 82Compiled By: Kamal Acharya
  • 83. Contd.. • 4-D cubes: a 4-D cube is a series of 3-D cubes, as shown in Figure below: • in this way, we may display any n-dimensional data as a series of (n-1)- Dimensional “cubes.” • Note: The data cube is a metaphor for multidimensional data storage. The actual physical storage of such data may differ from its logical representation. data cubes are n-dimensional and do not confine data to 3-D. July 2, 2019 83Compiled By: Kamal Acharya
  • 84. Data Cube implementation • Efficient data cube computation: – No Materialization – Full Materialization – Partial Materialization • Access methods: How OLAP data can be indexed(Bit map and join indices) • Query processing technique • OLAP server types – ROLAP – MOLAP – HOLAP July 2, 2019 84Compiled By: Kamal Acharya
  • 85. Computation of Data Cubes • Data warehouses contain huge volumes of data. • OLAP servers demand that decision support queries to answered in the order of seconds. • It is crucial for data warehouse systems to support highly efficient cube computation techniques, access methods and query processing techniques.
  • 86. Efficient Computation of Data Cubes • Cuboids: – Data at different degree of summarization/ aggregations is often referred to as a cuboid. – Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the given dimensions. – The result would form a lattice of cuboids, each showing the data at a different level of summarization(or group-by/aggregation). – The lattice of cuboids is then referred to as a data cube. July 2, 2019 86Compiled By: Kamal Acharya
  • 87. Contd.. • Example: – Suppose that you would like to create a data cube for ALLElectronics sales that contains : city, item, year, as the dimensions for the data cube and sales_in_dollars as the measure. – The possible group-by’s are the following: • {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ( )}, • where ( ) means that the group-by is empty (i.e., the dimensions are not grouped). – These group-by’s form a lattice of cuboids for the data cube, as shown in Figure below. July 2, 2019 87Compiled By: Kamal Acharya
  • 88. Contd.. July 2, 2019 88Compiled By: Kamal Acharya
  • 89. Contd.. • You would like to be able to analyze the data, with queries such as following: • “Compute the sum of sales, grouping by city and item.” • “Compute the sum of sales, grouping by city.” • “Compute the sum of sales, grouping by item.” July 2, 2019 89Compiled By: Kamal Acharya
  • 90. Contd.. • Special types of Cuboids: – Base cuboid: The base cuboid contains all three dimensions, city, item, and year. It can return the total sales for any combination of the three dimensions. The base cuboid is the least generalized (most specific) of the cuboids. – Apex cuboid: The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty. It contains the total sum of all sales. The apex cuboid is the most generalized (least specific) of the cuboids, and is often denoted as all. July 2, 2019 90Compiled By: Kamal Acharya
  • 91. • OLAP may need to access different cuboids for different queries. • a good idea : • compute all or at least some of the cuboids in a data cube in advance. • Pre-computation leads to fast response time and • avoids some redundant computation. • But, required storage space may explode (due to pre-computation of all cuboid, large number of dimensions and large number of concept hierarchies of dimensins) • This problem is referred to as the curse of dimensionality Curse of Dimensionality
  • 92. There are three choices for data cube materialization(computation of cuboids) given a base cuboid: 1. No Materialization 2. Full Materialization 3. Partial Materialization Data cube Materialization
  • 93. •No Materialization Do not pre-compute any of the “non-base” cuboid. This leads to computing expensive multidimensional aggregates on the fly, which can be extremely slow. Contd..
  • 94. •Full Materialization •Pre-compute all of the cuboids. •The resulting lattice of computed cuboids is referred to as the full cube. •This choice typically requires huge amounts of memory space in order to store all of the pre-computed cuboids. Contd..
  • 95. •Partial Materialization Selectively compute a proper subset of the whole set of possible cuboids. It represents an interesting trade-off between storage space and response time. The partial materialization of cuboids or sub-cubes should consider three factors:  Identify the subset of cuboids or sub-cubes to materialize  Exploit the materialized cuboids or sub-cubes during query processing  Efficiently update the materialized cuboid or sub-cubes during load and refresh. Contd..
  • 96. Indexing OLAP DATA • To facilitate efficient data accessing to further speed up query processing. • Two most commonly used methods • The Bitmap indexing method and • Join indexing method
  • 97. • Bitmap Indexing: • In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the domain of the attribute. • If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0. • Bitmap indexing reduces join, aggregation, and comparison operations to bit arithmetic. Contd..
  • 98. Figure below shows a base (data) table containing the dimensions item and city, and its mapping to bitmap index tables for each of the dimensions. Contd..
  • 99. • Join indexing: • The join indexing method gained popularity from its use in relational database query processing. • Join indexing registers the joinable rows of two relations from a relational database. • Hence, the join index records can identify joinable tuples without performing costly join operations. Contd..
  • 100. Example: join index relationship between the sales fact table and the location and item dimension tables is shown in figure below Contd.. Here, the “Main Street” value in the location dimension table joins with tuples T57, T238, and T884 of the sales fact table. Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and T459 of the sales fact table.
  • 101. Contd.. The corresponding join index tables are shown in Figure below.
  • 102. •The purpose of materializing cuboids and constructing OLAP index structures is to speed up query processing in data cubes. Given materialized views, query processing should proceed as follows: 1. Determine which operations should be performed on the available cuboids. 2. Determine to which materialized cuboid(s) the relevant operations should be applied. Efficient Processing of OLAP Queries
  • 103. Types of OLAP Servers • OLAP servers present business users with multidimensional data from data warehouses, without concerns regarding how or where the data are stored. • However, the physical architecture and implementation of OLAP servers must consider data storage issues. • Implementations of a warehouse server for OLAP processing include the following:  Relational OLAP (ROLAP)  Multidimensional OLAP (MOLAP)  Hybrid OLAP (HOLAP) July 2, 2019 103Compiled By: Kamal Acharya
  • 104. Contd.. • Relational OLAP (ROLAP) Server: – These are the intermediate servers that stand in between a relational back- end server and client front-end tools. – They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. – ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. – ROLAP technology tends to have greater scalability than MOLAP technology. July 2, 2019 104Compiled By: Kamal Acharya
  • 105. Contd.. • Multidimensional OLAP (MOLAP) Server: – These servers supports multidimensional views of data through array-based multidimensional storage engines. – They map multidimensional views directly to data cube array structures. – The advantages of using a data cube is that it allows fast indexing to pre- computed summarized data. – In multidimensional data stores, the storage utilization may be low if the data set is sparse. July 2, 2019 105Compiled By: Kamal Acharya
  • 106. Contd.. • Hybrid OLAP (HOLAP) Servers: • The hybrid OLAP approach combines ROLAP and MOLAP technology. • Benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. July 2, 2019 106Compiled By: Kamal Acharya
  • 107. Contd.. • MOLAP vs. ROLAP: July 2, 2019 107Compiled By: Kamal Acharya MOLAP ROLAP Information retrieval is fast. Information retrieval is comparatively slow. Uses sparse array to store data-sets. Uses relational table. MOLAP is best suited for inexperienced users, since it is very easy to use. ROLAP is best suited for experienced users. Maintains a separate database for data cubes. It may not require space other than available in the Data warehouse. DBMS facility is weak. DBMS facility is strong.
  • 108. Data Cube operations • A number of operations may be applied to data cubes for OLAP. • Data cube operations are also known as OLAP operations. The common ones are: – Slice – dice – Roll-up(Drill-up) – Drill-down(Roll-down) – Pivot(Rotate) July 2, 2019 108Compiled By: Kamal Acharya
  • 109. Contd.. • The cube contains the dimensions location, time, and item, where location is aggregated with respect to city values, time is aggregated with respect to quarters, and item is aggregated with respect to item types. – The measure displayed is dollars sold (in thousands). – The data examined are for the cities Chicago, New York, Toronto, and Vancouver. July 2, 2019 109Compiled By: Kamal Acharya A data cube for AllElectronics sales to illustrate data cube operation:
  • 110. Contd.. • Slice: – Slice operation performs a selection on one dimension of the given cube, thus creates subset a cube. – Below example depicts how slice operation works- July 2, 2019 110Compiled By: Kamal Acharya
  • 111. Contd.. • Dice: – Dice operation performs a selection on two or more dimension from a given cube and creates a sub- cube. – Below example depicts how Dice operation works- July 2, 2019 111Compiled By: Kamal Acharya
  • 112. Contd..• Roll-up(Drill-up): – The roll-up operation performs aggregation on a data cube, either : • by climbing up a concept hierarchy for a dimension or • by dimension reduction. – Below example depicts how roll-up operation works- July 2, 2019 112Compiled By: Kamal Acharya
  • 113. Contd..• Drill-down(Roll-down): – Drill-down is the reverse operation of roll-up. It is performed by either of the following ways: • By stepping down a concept hierarchy for a dimension • By introducing a new dimension. – It allows users to navigate among different levels of data i.e. most summarized (up) to most details (down). – Below example depicts how Drill- down operation works July 2, 2019 113Compiled By: Kamal Acharya
  • 114. Contd.. • Pivot: – Pivot also known as rotation changes the dimensional rotation of the cube, i.e. rotates the axes to view the data from different perspectives. The below cubes shows 2D representation of Pivot July 2, 2019 114Compiled By: Kamal Acharya
  • 115. Guidelines for OLAP implementation • A number of Guidelines for successful implementation of OLAP are as follows: – Vision – Senior management support – Selecting an OLAP tool – Corporate strategy – Focus on the users – Joint management – Review and adapt July 2, 2019 115Compiled By: Kamal Acharya
  • 116. Contd.. • Vision: – The OLAP team must, in consultation with the users, develop a clear vision for the OLAP system. This vision including the business objectives should be clearly defined, understood, and shared by the stakeholders. • Senior management support: – The OLAP project should fully supported by the senior managers, since a data warehouse may have been developed already this should not be difficult. July 2, 2019 116Compiled By: Kamal Acharya
  • 117. Contd.. • Selecting an OLAP tool: – The OLAP team should familiarize themselves with the ROLP and MOLAP tools available in the market. Since tools are quite different, careful planning may be required in selecting a tool that is appropriate for the enterprise. In some situations, a combination of ROLAP and MOLAP may be most effective. • Corporate strategy: – The OLAP strategy should fit with the enterprise strategy and business objectives. A good fit will result in the OLAP tools being used more widely. July 2, 2019 117Compiled By: Kamal Acharya
  • 118. Contd.. • Focus on users: – The OLAP project should be focused on users. Users should, in consultation with the technical professionals, decide what tasks will be done first and what will be done later. Attempts should be made to provide each user with a tool suitable for that person’s skill level and information needs. A good GUI user interface should be provided to non-technical users. The project can only be successful whit the full support of the users. July 2, 2019 118Compiled By: Kamal Acharya
  • 119. Contd.. • Joint Management: – The OLAP project must be managed by both the IT and business professional. Many other people should be involved in supplying ideas. An appropriate committee structure may be necessary to channel these ideas • Review and adapt: – Organizations evolve and so must be OLAP system. Regular reviews of the project may be required to ensure that the project is meeting the current needs of the enterprise. July 2, 2019 119Compiled By: Kamal Acharya
  • 120. Data Mining vs. OLAP • OLAP - Online Analytical Processing – Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening. • Data Mining is a combination of discovering techniques + prediction techniques July 2, 2019 120Compiled By: Kamal Acharya
  • 121. Home Work • What are dimension, members, measure and fact table? • List the major difference between OLTP systems and OLAP systems. • What is OLAP and its purpose? List the characteristics of OLAP systems. • What is data cube and purpose of data cube? Use an example to illustrate the use of data cube. • What are ROLAP and MOLAP ?describe the two approaches and list their advantages and disadvantages. • Describe the operations(OLAP/ Cube operations) roll-up, drill-down, and slice and dice. • List the implementation guidelines for implementing OLAP. July 2, 2019 121Compiled By: Kamal Acharya
  • 122. Thank You ! Compiled By: Kamal Acharya 122July 2, 2019
  翻译: