Organising the Data Lake - Information Management in a Big Data World

Organising The Data Lake
- Information Management In A Big Data World
Mike Ferguson
Managing Director
Intelligent Business Strategies
Hadoop Summit
Dublin, April 2016

2Copyright © Intelligent Business Strategies 1992-2016!
About Mike Ferguson
Mike Ferguson is Managing Director of
Intelligent Business Strategies Limited. As an
analyst and consultant he specialises in
business intelligence, data management and
enterprise business integration. With over 34
years of IT experience, Mike has consulted for
dozens of companies, spoken at events all over
the world and written numerous articles.
Formerly he was a principal and co-founder of
Codd and Date Europe Limited – the inventors
of the Relational Model, a Chief Architect at
Teradata on the Teradata DBMS and European
Managing Director of DataBase Associates.
www.intelligentbusiness.biz
mferguson@intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700

Topics
 The data integration complexity
 The siloed approach to managing and governing data
 A new inclusive approach to governing and managing data
 Introducing the data reservoir and data refinery
 How does a data reservoir and data refinery work?
 Mapping new data and insights into your shared business vocabulary
 The mission critical importance of an information catalog in a distributed data
landscape
 Integrating data reservoirs and data refineries into your existing environment

The Changing Landscape – We Now Have Different Platforms Optimised For
Different Analytical Workloads
Streaming
data
Hadoop
data store
Data Warehouse
RDBMS
NoSQL
DBMS
EDW
DW & marts
NoSQL
Graph DB
Advanced Analytic
(multi-structured data)
mart
DW
Appliance
Advanced Analytics
(structured data)
Analytical
RDBMS
Big Data workloads result in multiple platforms now being needed for analytical processing
C
R
U
D
Prod
Asset
Cust
MDM
Traditional
query,
reporting &
analysis
Real-time
stream
processing &
decision m’gmt
Data mining,
model
development
Investigative
analysis,
Data refinery
Data mining,
model
development
Graph
analysis
Graph
analysis

Data Integration Today Has Become Much More Complex
- Popular Data Integration Paths Between Platforms
EDW
DW
Appliance
Analytical DBMS
MDM System
C
R
U
D
Prod
Asset
Cust
XML,
JSON
social
Web
logs
ERP
CRM
SCM
Ops
Graph
DBMS
NoSQL DB
Column Fam DB
Document DB
NoSQL DB
web
Data martsTransaction data
Cloud data may
also be part of it
insights
Txns

Issues: Siloed Analytics - Different Tools To Manage And Integrate Data For
Each Type Of Analytical And MDM Store
Analytical
tools
Data
management
tools
EDW
mart
Structured data
CRM ERP SCM
Silo
DW & marts
Analytical
tools/apps
Data
management
tools
Multi-structured
data
Silo
DW
Appliance
Advanced Analytics
(structured data)
Data
management
tools
Structured data
CRM ERP SCM
Analytical
tools
Silo
Analytical
tools/apps
Data
management
tools
NoSQL DB
e.g. graph DB
Silo
Multi-structured &
structured data
Silo
C
R
U
D
Prod
Asset
Cust
MDM
Applications
Data
management
tools
Master data
management
CRM ERP SCM

Issues: Data Deluge - Data Is Arriving Faster Than We Can Consume It
F
D I
A L
T T
A E
R
Enterprise
Enterprise
systems

With 000’s Of Data Sources, IT And Business Need To Working Together As IT
Will Likely Become A Bottleneck
IT
OLTP
systems
Web
logs
web
DQ/DI
job
DQ/DI
job
DQ/DI
job
Open data
IoT
machine data
social & web
C
R
U
prod cust
asset
D
MDM
DW
Data
warehousing
cloud
Data virtualisation
Can business analysts &
Data Scientists help?
DQ/DI
job
DQ/DI
job
DQ/DI
job
???
Bottleneck?
Should IT be expected
to do everything?
Big Data

Issues: Have You Got Self-Service Data Integration Causing Chaos In The
Enterprise?
social
Web
logs
web cloud
sandbox
Data Scientists
sandbox
Data Scientists
sandbox
Data Scientists
HDFS
ETL
/ DQ
Self-service
BI tools with ETL
ETL
new
insights
SQL on
Hadoop
DW
ETL
/ DQ
DW
marts
ETL
SCM
CRM
ERP
ETL/D
Q
marts Self-service
BI tools with ETL
ETL/D
Q
Built by IT
ETL/
DQETL/
DQETL/
DQ

Problems With The Current Approach
 Project oriented siloed approach to DI/DQ with limited collaboration
 Cost of data integration is too high
 Slow speed of development
 Multiple DI/DQ technologies and techniques being used that are not integrated
 Lots of re-invention rather than re-use
 Fractured metadata across multiple tools or no metadata at all in some cases
 Risk of duplicate inconsistent DI/DQ rules for same data
 Metadata lineage is unavailable in many places especially with hand-coded Big Data DI/DQ applications
 Multiple skill sets fractured across different projects
 Repetition of our mistakes, e.g. Big Data preparation
EDW C
R
U
D
Prod
Asset
Cust
MDMDQ/DI
DQ/DI
DQ/DI
DQ/DIDQ/DI
cloud Data
virtualisation
DQ/DIDQ/DI
DQ/DI
Self-service

There has to be a better, more governed
way to fuel productivity and agility without
causing data inconsistency and chaos
EDW
DQ/DI
C
R
U
D
Prod
Asset
Cust
MDM
DQ/DI
DQ/DIDQ/DI
cloud Data
virtualisation
DQ/DIDQ/DIDQ/DI
DQ/DI
Self-service
Tools are available but are not well integrated
Also the whole collaborative, metadata and information catalog piece is incomplete
IT IS NOT ENOUGH – THE WHOLE THING HAS TO BE CO-ORDINATED

We Are All In The Same Boat!
– Everyone For Themselves Is Not An Option
IT Data ArchitectData Scientist
IT Developer Business analyst

Information Management
– Introducing The Data Lake
Reservoir
Reservoir

What Is A Data Reservoir? - A Collaborative, Governed Environment Aimed At
Rapidly Producing Information
IT Data Architect
Data ScientistDomain Expert
community
Bus. analyst
Need to work together for competitive advantage
Data ScientistIT Developer
community
Data
Architect
community
Domain Expert
community
Bus. analyst
Bus. analyst Data Architect
community

Chaos Is NOT An Option – Business Alignment Of Information Being Produced
Is Critical To Success
Big Data Project
Big Data Project
DW Project
MDM ProjectProject
Strategic Objectives
Business
Strategy
• What problem are you
trying to solve?
• What data do you need?
• What kind(s) of analytic
workload are needed
We need co-ordinated
“info producer” projects in
a managed environment

Key Capabilities In A Managed Data Reservoir - 1
 Data collection
• Automated discovery of the structure and formatting
• Data structure inferred by machine learning
• Automated cataloging, infinite storage and processing
 Data classification
• Determines how data should be governed
• Support is needed for different types of classification schemes, e.g.
Retention
Unclassified
Temporary
Project Lifetime
Managed period
Permanent
Confidential
Unclassified
Internal use
Business confidential
Supplier confidential
Sensitive (PII)
Sensitive (Financial)
Sensitive (Operations)
Restricted (Trade secret)
Confidence
Unclassified
Raw (original)
Obsolete
Archived
Trusted
Business
Value
Unclassified
Unimportant
Marginal
Important
Critical
Catastrophic

Key Capabilities In A Managed Data Reservoir - 2
 Collaborative data governance
• Data quality
• Data trustworthiness (confidence)
• Data protection
– Data privacy, access authorisation, lifecycle management
• Compliance
 Data refinery
• Systematically clean and refine data through various stages
• Manual and guided data preparation
• “Sandbox” analyse data to produce high value insights
 Data as a Service (DaaS)
• Published high value insights available for consumption
• Search for and discover trusted insights, subscribe to receive it
 Data consumption
• Provision refined, trusted commonly understood data into any tool or application

Data virtualisation services
A Data Reservoir Is An Organised Collection Of Raw, In-Progress And Trusted
Data (Multiple Data Stores)
DW
MDM
C
R
U
D
Prod
Asset
Cust
Data marts
Cloud object
storage
Refinedtrusted&integrateddata
Stronggovernance
Rawuntrusteddata
somegovernance
ECM Staging
areas
ODS
RDM
C
R
U
D
Code
sets
Archived
DW data
Hive
tables
feedsIoT
XML,
JSON
RDBMS Files office docssocial Cloud
clickstream
web logs web
services
NoSQL
ODS
ODS
DW
Text /
Image/
Video
Filtered
sensor
data
Published
trusted
data
Search
indexes
In-progress
data
Data Reservoir
(not a data store but a collection of stores)
Data sources and ingested reservoir
data are all known to the catalog
Info
Catalog

Replicate
Streaming
Batch Load
Archive
Raw Data Is Being Collected In Multiple Places Across The Enterprise – We
Need To Know What’s Happening!
We need to avoid unconnected silos
But we HAVE TO know what is being collected and
filtered and where that is happening
Also who is doing it, for what business purpose?

If Multiple Collection Points Exist Then Something Has To Catalog What Data Is
Available, Its Status And Where It Is
All data entering a
reservoir needs to
be catalogued and
organised
You need to know what data is available across the enterprise, where it
came from, what state is it in, should we trust it, can we order it
Information Catalogue

A Distributed Data Reservoir Requires Information Management Software To
Work Across Multiple Data Stores
Enterprise Information Management (Catalog, DQ, ETL, Security, Privacy…)
The Data Reservoir is distributed but is should be managed
and function as if it were centralised
Key requirements
Define once, execute anywhere
Centralised metadata
Distributed execution of policies associated with data quality, ETL, security, lifeecycle
management across the landscape (multiple execution engines)

Replicate
Streaming
Batch Load
Archive
A Distributed Data Reservoir Requires Management And Governance As If It
Was Centralised
The data in the reservoir is distributed but the reservoir
is managed and operated as if it were centralised

Information Production Is A Process That Involves Refining And Integrating Data
High value
information
and /or insights
available for
consumption
Raw
data
Raw
data
Trusted
data
Collaboration is needed
to perform many tasks in
producing information,
e.g. selecting &
transforming data
Reservoir storage
Raw
data
Raw
data
In-
progress
data
Trusted
data

The Information Production Process Works Across Zones In The Reservoir –
Zones Created By Tagging Files
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis

Organising Data In A Reservoir – The Catalog Knows About Data Sources Plus
Data In All Zones And Sandboxes
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir

Operating A Data Reservoir – The Information Production Process Is A
Production Line That Spans Reservoir Zones
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Nominate
new data
Classify
sensitivity,
quality,
retention
Tag data
(what’s it
mean?)
Assign
governance
policies based on
classification
Collaborate
about
processing
Track data
freshness
Rate its value
★★★★
Analyse
consume
Reservoir operations are
controlled via the catalog
and workflow processes
Info
Catalog
Map to shared
business
vocabulary

Operating A Data Reservoir – Workflows Are Everywhere And Are Components
Of An Information Production Process
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Ingest
w/flow
movement
w/flow
movement
w/flow
Publish
w/flow
Publish
w/flow
Provision
w/flow
Refinery
w/flow
Analytical
w/flow
Gov
w/flow
Gov
w/flow
Stream
w/flow

Trends – Data And Analytical Workflow (Pipeline) Products Requiring No
Programming Are Emerging Everywhere
Talend Alteryx
Microsoft Azure Data Factory
Hortonworks
Dataflow (Nifi)
Dell Statistica
Who is using what
tools?
Any reinvention?

Operating A Data Reservoir – All Workflows Should Be Approved And
Registered In The Information Catalog
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
ETL/
Data prepDQ
Data Ingestion
zone
(transient data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Ingest
w/flow
Publish
w/flow
Publish
w/flow
movement
w/flow
movement
w/flow
Provision
w/flow
Refinery
w/flow
Analytical
w/flow
Gov
w/flow
Gov
w/flow
Stream
w/flow
Convert SSDI workflows to data
virtualisation views to minimise re-
invention and enforce governance
virtualviewvirtualview
virtualview

Data Strategy Requirements – We Need To Enable Information Producers And
Information Consumers
 Need to make use of
• A business glossary and information catalog
• Re-usable services to manage and process data
• Collaboration and social computing to manage, process and rate data
• Role-based data management tools aimed at IT AND business
clean &
integrate
service
raw data
trusted data
Information
catalog
BI tool or
application
search
find
shop
order consume
data scientist
IT professional
information producers
clean &
integrate
service
raw data
business analysts
information consumers
like a
“corporate
iTunes” for
data

A ‘Production Line’ Publish And Subscribe Approach Is Used To Accelerate
Information And Insight Production
data
source
Data
Integration
publish
Info
catalog
trusted data
as a service
publish Info
catalog
trusted, integrated
data ad a service
subscribe
Analyse
(e.g. score)
consume
publishAnalytics
catalog
New predictive
analytic pipelines
(as a service)
consume
subscribe
Visualise
Decide Act
Other, e.g. embed
analytic applications
consume
subscribe
publish
Solutions
catalog
New prescriptive
analytic pipelines
publish New analytic
applications
use
crawl
discover
profile
publish
Info
catalog
discovered
data
Acquire
Acquire
Acquire
Data Preparation
(clean, transform, filter)

Cataloging, Automated Discovery And Collaboration Are All Needed When Data
Is Ingested
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)
IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,
JSON
web
services
NoSQL
Files
DW data
streams
Data Reservoir
Analyse
consume
Automated relationship
discovery, data profiling,
and document clustering
Descriptive metadata is
critical to keeping things
organised
Info
Catalog
Catalog, tag and
describe data/files
(what’s it about?)
collaborative
appraisal

Governance In A Data Reservoir Is Controlled By Classification And Metadata In
The Information CatalogClassifications drive the governance
Governance
Rule
Governance
Rule
Governance
Rule
Classification
Classification
Information
Rule
Information
Governance
Rule
Classified by
Actioned
by
Physical Data
Description
Policy
Governs
Implemented
by
Policy
ProcessAssessed by
Business
Attribute
Classified by
Mapped to
Governs Sensitive
IT Landscape
Deployed toGovernance
Action
Describesby
Engine
Accesses
Metrics
Measures
ProcessAssessed by
Feeds
Operational
Log
Logs activity
Describes
Data storeData store/
Document/
File/API
Measures
Measures
9Source: IBM

IBM Are Creating ‘Governance Aware’ Runtimes To Verify And Enforce Policies
In A Data Reservoir
Source: IBM
They access the information
catalog to determine what to
do at run time

We Need A Data Refinery To Process, Clean And Analyse Data To Produce
Consumable High Value Insight
cloud On-premises
DW Analytical
RDBMS
ETL
Server
Data Virtualisation
Server
A data refinery should be able to choose where to best refine data to produce the information needed

Data virtualisation services
A Key Requirement In A Distributed Data Reservoir Is Centralised Development,
Distributed Execution
MDM
C
R
U
D
Prod
Asset
Cust
Data marts
Cloud object
storage
Refinedtrusted&integrateddata
Stronggovernance
Rawuntrusteddata
somegovernance
ECM Staging
areas
RDM
C
R
U
D
Code
sets
Archived
DW data
Hive
tables
feedsIoT
XML,
JSON
RDBMS Files office docssocial Cloud
clickstream
web logs web
services
NoSQL
Text /
Image/
Video
Filtered
sensor
data
Published
trusted
data
Search
indexes
In-progress
data
Data Reservoir
(not a data store but a collection of stores) Info
Catalog
ODS
DW
staging
area
EIM Tool Suite (Profiling, cleansing, ELT)
ODS
ODS
Execution
engine
Execution
engine
Execution
engine
Execution
engine
Execution
engine
Execution
engine
IT User
Interface
Self-
service UI
Execution
engine
Execution
engineExecution
engineExecution
engine
Execution
engineExecution
engineExecution
engine

On-premises
storage
DW
staging
area
Cloud
storage
Execution
engineExecution
engine
Execution
engine
Execution
engine
Execution
engine
If A Data Reservoir Is Distributed With Data Too Big To Move Then Processing
Needs To Go The Data
Not centralised,
Not distributed
But Federated
Task
Task
Task
Task
Task

Options For Refining Data
 IT developed ETL processing using EIM tool suites
 Self-service data integration
 Multi-role EIM tool suites
• Can be used by both IT AND business users
 Data virtualisation server
 A combination of the above

Scaling ETL Transformations For In-Hadoop ELT Processing
Data Cleansing and Integration Tool
Extract Parse Clean Transform AnalyseLoad Insights
Option 1
ETL tool generates HQL or
convert generated SQL to HQL
Option 2
ETL tool generates Pig
(compiler converts every
transform to a map reduce
job) or JAQL
Option 3
ETL tool generates 3GL MR
or Spark code
Option 4 – Other
Native massively parallel transformation and
integration bypassing any Hadoop execution
engine
E.g. Talend, IBM BigIntegrate, Informatica

Self-Service Data Integration Tool Vendors
 Actian Dataflow
 Alteryx
 Clear Story Data
 Datameer
 IBM DataWorks
 Informatica Rev
 Paxata
 SAS Data Loader
for Hadoop
 Tamr
 Trifacta
Acquire
Data Preparation
Analyse
(e.g. Score)
Visualise
Decide Act
Data
Integrationdata
Embed
Acquire
Data Preparation
Analyse
(e.g. Score)
Visualise
Decide Act
Data
Integrationdata
Embed
Data preparation, integration, analysis & visualisation
Data preparation and integration

Some Data Management Vendors Are Trying To Cover All Roles And Integrate
With Other Vendors, e.g. Informatica
Informatica
Catalog & Live
Data Map
Analyst toolData &
Metadata
Relationship
Discovery
Services
Data Quality
Profiling &
Monitoring
Services
Data
Modeling
Services
Data
Cleansing &
Matching
Services
Data
Integration
Services
Business
Glossary
/ Info Catalog
Services
Data Governance/Management Console
Data
Privacy &
Lifecycle
Management
Services
Data
Audit &
Protection
Services
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
Informatica Rev
Self-service
Cloud DI
metadata
metadata

Data &
Metadata
Relationship
Discovery
Services
Data Quality
Profiling &
Monitoring
Services
Data
Modeling
Data
Cleansing &
Matching
Services
Data
Integration
Services
(virt & ETL)
Business
Glossary
/ Info
Catalog
Services
metadata
Data
Privacy &
Lifecycle
Management
Services
Data
Audit &
Protection
Services
ESB
Information
services
C
R
U
prod cust
asset
D
MDM
DW
Data
warehousing
Big Data
Data virtualisation
cloud
Business UserIT DeveloperIT Data Architect
App Self-
Service
Enterprise Service Bus
Some Vendors Are Opening Up Their Service Oriented Data Management
Platforms To IT AND Business Users
Role-based
Uis to the same
data management
platform
Workflow

Alternatively Interoperability Is Needed Across Tools To Use Data Preparation
Jobs Developed By Different Users
Stand-alone
Data Wrangling
tools
Data &
Metadata
Relationshi
p
Discovery
Services
Data
Quality
Profiling &
Monitoring
Services
Data
Modeling
Services
Data
Cleansing
& Matching
Services
Data
Integration
Services
Business
Glossary
/ Info
Catalog
Services
Data
Privacy &
Lifecycle
Management
Services
Data
Audit &
Protection
Services
EIM Tool Suite
Business Analyst
PowerQuery
Self-Service DI
embedded in Self-
Service BI tools
Microsoft Data Factory
Dell Boomi
SnapLogic
IBM DataWorks
Informatica Rev
Cloud DI
Interoperability
metadata metadata
metadatametadata

Metadata Management In A Data Reservoir
- EIM Platform Information Catalog And Apache Atlas
Stand-alone
Data Wrangling
tools
Services
EIM Tool Suite
Business Analyst
PowerQuery
Self-Service DI
embedded in Self-
Service BI tools
Dell Boomi
SnapLogic
IBM DataWorks
Informatica Rev
Cloud DI
metadata
metadata
metadata
metadata
atlas
Graph store
atlas atlas
Information
Catalog

Metadata Management In A Data Reservoir
- Stand-Alone Information Catalog And Apache Atlas
Stand-alone
Data Wrangling
tools
Services
EIM Tool Suite
Business Analyst
PowerQuery
Self-Service DI
embedded in Self-
Service BI tools
Dell Boomi
SnapLogic
IBM DataWorks
Informatica Rev
Cloud DI
metadata
metadata
metadata
metadata
atlas
Graph store
atlas atlas
Information
Catalog
metadata atlas

New Trusted Data Produced By Refining Un-Modelled Data Should Be Defined
In A Business Glossary
Raw data In-Progress data Refined data
Untrusted Trusted
corporate
firewall
Fit for use
Data Refinery
sandbox
Business
Glossary
DataVirtualisation
Could implement the
SBV in a data
virtualisation server

The Critical Importance Of An Information Catalog
– We MUST Be Able To Answer This Question
Business user
What information exists
about……….?
An Information Catalogue
Where is that likely to be documented?

The Information Catalog
- What Else Do I Want To Know?
Can I search for information? (faceted search via your SBV)
Does the data exist?
Is the data trusted? (what is the rating)
Is the data sensitive? (what is the rating)
Is it high business value (what is the rating)
Can I order it?
Can I specify where to deliver it to and in what format?
Can I see where is it used and who owns it?
Information Catalogue

Information Catalog Example - Waterline Data

Faceted Navigation Used In E-Commerce (e.g. Amazon) Is About To Get A
Much Bigger Role In Data Management
Add it to
your cart
Select the
products you
want

Ordered Parcel Delivery – The Same Thing Will Happen To Provision Ordered
Data
Ordered data

Virtual Information Provisioning Needs Policy Awareness At Runtime To Create
Virtual Views That Enforce Governance
Information
provisioning
service
Virtual data subset
Virtual full data set
security
policy
(some data not
permitted to be seen)
(all data permitted
to be seen)
“Finished-Goods”
Refined data
Information
provisioning
service
Virtual data subset
Virtual full data set
compliance
policy
(some data not
allowed to be
provisioned outside
the country)
(all data
provisioned inside
the country)
Data reservoir
All data
has SBV DataVirtualisation

Conclusions
 The challenge is now to manage data in the entire analytical ecosystem
 Invest in new skills and training needed in this environment
 Data needs to be organised in a data reservoir to prevent chaos
 Hadoop is becoming a platform to accelerate cleansing and ETL processing to conduct
exploratory analytics
 Multiple options exist to allow IT and business users to clean and integrate data in preparation
for analysis
• Data integration vendors have added functionality to support Hadoop
• Self-service data cleansing and integration tools also exist
 The ideal solution is a single platform that supports IT and business user self-service data
integration
 An information catalog is critical for end-to-end data governance
• Understanding what data is available (descriptive metadata)
• Understand how it was transformed (metadata lineage)
 Data virtualisation is needed to see across multiple data reservoirs
 Start small and build out incrementally – don’t just load data and hope

www.intelligentbusiness.biz
mferguson@intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
Thank You!

Organising the Data Lake - Information Management in a Big Data World

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Organising the Data Lake - Information Management in a Big Data World

Similar to Organising the Data Lake - Information Management in a Big Data World (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Organising the Data Lake - Information Management in a Big Data World