2017 OpenWorld Keynote for Data Integration

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
OpenWorld 2017
Data Integration Platform Keynote
Next-Gen Enterprise Data Management
Jeff Pollock
Vice President, Product Management
PaaS and Big Data Integration & Governance
October 02, 2017
Confidential – Oracle Internal/Restricted/Highly Restricted

Cloud
Platform
On-Prem
Operations Insights
from Analytics
Move
Workloads
Embrace
SaaS
Modernize
AppDev
Our Most Innovative Customers are on a Journey to Cloud…

Photo
Film
Music
Industry
Maps
Television Spotify
Netflix Smartphone
Waze
Yellow Pages
Yelp
Digital Transformation is the Key Business Driver…

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 4
Business & economic model
Strategic execution & delivery
Common resources
Business opportunities
Integrated Applications, IT & Data
Managed as one
4

Cloud
Platform
On-Prem
Operations Insights from
Analytics
Migrate
Oracle and
Non-Oracle
Workloads
Disaster
Recovery in
the Cloud
Move Data
Warehouses
Connect and
Extend Apps
Move
Workloads
Integrate
& Automate
SaaS
with
On-Prem
Extend for
Social, Mobile,
Process
Embrace
SaaS
Unify SSO
and Security
Gain Insights
from
Combined
Analytics
Build Cloud
Native Apps
Dev/Test
Environments
Visual
Development
Innovate with
Intelligent
Bots
Modernize
AppDev
Migrate
Analytics,
Warehouse
Enable
Smart
Self-Service
Insights across
Data Lakes
Integrated Apps, Data & IT are Mandatory for Success…

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. 6
Oracle Integration Platform
Comprehensive Best-of-Breed Capabilities for All Integration Needs
Applications Infrastructure Analytics
Integration for… Integration for… Integration for…
Cloud Integrations
On-Premises Integrations
Unified Technology Platform (PaaS)

Applications Infrastructure Analytics
Integration for… Integration for… Integration for…
Unified Technology Platform (PaaS)
Application
Integration
API
Management
Process
Integration
Stream
Processing
Data
Replication
Bulk Data
ETL & E-LT
Metadata
Management
Data
Quality
Unified Integration Capabilities
Converged Solution for All Integration Needs

Oracle Integration Platform
Converged Solution for All Integration Needs
Complete
Simplified
Open
DATA
GOVERNANCE
PROCESS
AUTOMATION
STREAM
ANALYTICS
API
MANAGEMENT
APPLICATION
INTEGRATION
DATA
QUALITY
BULK DATA
TRANSFORMATION
REAL TIME DATA
STREAMING AND DATA
REPLICATION

NEW: Oracle Data Integration Platform
Integrate Cloud and On Premise Data Lakes and Data Warehouses
…a Unified solution …that’s Easy to use …for Powerful data-driven solutions
Key Capabilities
1. Data High Availability
2. Data Migrations
3. Data Warehouse
Automation
4. Databus & Stream
Integration
5. Data Governance

DIPC Solution Use Cases
Database
record level
sharding
Data High
Availability
Multi-Region
Cloud
Availability
(Oracle or
Amazon)
Active-Active
Databases
Migrate from
Amazon RDS to
Oracle Cloud
Data
Migrations
PeopleSoft or
Workday into
Fusion HCM
Oracle Database
Migrations into
12c
Customer 360
from Salesforce
or Sales Cloud
DW/Mart
Automation
Marketing
Analytics on Big
Data Cloud
Move a Data
Warehouse into
the Cloud
Streaming ETL
for Data
Pipelines
Streaming
Integration
3 Kinds of Data
Lineage for LoB
and IT Users
Serving Layer
for Raw Data
Access
Prepared Data
Subscriptions
for LoB
Data
Governance
Data Catalog
and Policies
Data
Profiling and
Cleansing

BUT: Data Management is going through
a major transformation…

Discovery
RESTful API for Producers and Subscribers (events are pushed)
Raw Data
Topics
Schema
Event Topics
Data
Pipeline
(ETL)
Prepared
Data Topics
Master Data
Topics
Data
Pipeline
(ETL)
1,000’s 100’s 10’s
Oracle Open World 2015 12
App
DB
App
DB
App
DB
ERP
Operational
Data Store
EDW
Staging Prod
ETL
ETL
ETL
ETL
ETL
Mart
Mart
Mart
ETL
Enterprise BI
Mart
Mart
Mart
ETL
Departmental BI
Discovery
App
DB
App
DB
App
DB
ERP
WebApps
Mobile
EDW
NoSQL
Hadoop / Spark
Marts Marts
Less Governed --------------------------------------------------------------- More Governed
Enterprise BI
Departmental BI
Apps / Mobile
Classical Data Management: Hub and Spoke
• Invasive on Sources
• High Latency / SLA
• Mainly Relational Views
• Heavy IT process overhead
• Vendor-centric software
Next-Gen: Streaming Databus/Kappa
• Low impact on Sources
• Low Latency (< 1 second)
• Variety of Data Formats
• More Agile DevOps processes
• Open source centric software
GoldenGate
MDM
Hub
After 20yrs Reign… Hub-and-Spoke is now a Legacy
• ODS & ETL Hubs
• EDW/Mart Hubs
• MDM/RDM Hubs
• Static Data Lake Hubs
• Pub/Sub for Staging
• ETL in Pipelines
• Analytics/CEP in Stream
• Data is in Motion
NoSQL / APIs
LEGACY:
NEXT-GEN:
Less Governed ---------------------------------------------------- More Governed
Physical Layer for ETL Pipelines = MPP Streaming (eg; Apache Spark Streaming)
Physical Layer for Events = MPP Messaging (eg; Apache Kafka)

Data
Staging
or Archive
Data Discovery
ETL Offload
Batch Layer
Oracle Confidential 13
Business
Data
Analytics
EDWs
Data Streams
Social and Logs
Enterprise Data
Highly Available
Databases
Databus
(topic modeling)
Stream Analytics
ETL Data Pipelines
Speed Layer
Our Vision is to enable the modern ‘Kappa style' data architecture for Enterprise Strength solutions
• Raw Data Layer common ingestion point for all enterprise data sources
• Speed Layer data processing for streaming data and ETL data pipelines, in-memory
• Batch Layer data processing for huge data volumes, that may span long time periods, using MPP
• Serving Layer technologies for easy access to any data, at any latency
Raw Data
Layer
Raw Events
Changed Data
Schema Events
Core Design Pattern: Kappa-style Databus
Pub / Sub
REST APIs
NoSQL
Bulk Data
Serving
Layer
Apps

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential 14
Business
Data
Serving
Layer
Apps
Analytics
EDWs
Batch Layer
Data Streams
Social and Logs
Enterprise Data
Highly Available
Databases
Analytics
Speed Layer
Pub / Sub
REST APIs
NoSQL
Bulk Data
Raw Data
Layer
Oracle Approach: Blend of Commercial + Open Source
Modern Architecture will be a ‘Hybrid Open-Source’ pattern:
• Open Source at the core of speed and batch processing engines for general purpose data workloads
• Enterprise Vendors for connecting to legacy systems, strong governance, and for highly optimized workloads
• Cloud Platforms for Dev-Test (at least), rapid prototyping and eventually all production workloads
• SaaS & Applications are key data “producers” and will remain largely proprietary and/or highly customized

Business
Data
Serving
Layer
Apps
Analytics
EDWs
Batch Layer
Data Streams
Social and Logs
Enterprise Data
Highly Available
Databases
Analytics
Speed Layer
Pub / Sub
REST APIs
NoSQL
Bulk Data
Raw Data
Layer
Best-of-Breed: Oracle Platform for Kappa-style Architecture
Oracle Software can help customers Accelerate & Reduce Risk around adoption:
• Ingest Data with lower latency, greater reliability and from any database using Oracle GoldenGate
• ETP Pipelines for Data automate pipeline creation with zero-footprint using Oracle Data Integrator
• Analyze Data In-Motion run temporal, spatial and predictive algorithms with Oracle Stream Analytics
• Foundation Services for hosting Kafka (Event Hub) Spark/Hadoop (Big Data Cloud) or Relational (Database)
• Govern the data flowing through Kappa architecture with Oracle Metadata Management
GoldenGate
Data Integrator
Stream Analytics
Event Hub Big Data Database
Metadata Management (for Data Governance)

Kappa at Massive Scale
Using eBay’s Rheos
Connie Yang
Principal MTS for eBay Data Platform
eBay Software Engineering
October 02, 2017
Presented by

Rheos: A Business Focused Real-Time Data Platform
✓ Fully managed real-time streaming data platform
built with Oracle GoldenGate, Kafka, MirrorMaker
and Storm
✓ Provide shared, curated, “private” streams and
stream processing computation running on eBay
cloud
✓ Dynamic stream endpoint discovery
✓ Standardized data format & stream catalog
✓ Secure stream access control
✓ Data movement across security zones over a TLS
connection
✓ Comprehensive monitoring, alerting and
remediation

Business Motivation
Value
✓ Data Democratization
✓ Real-Time Seller Insights
✓ Data-Driven Recommendation
✓ Data-Driven Business Models
✓ Higher Conversion Rates
Method
✓ Standardized event header with Avro and stream namespaces
✓ A schema registry to store metadata or schema definition for
each stream
✓ Logical to physical stream mapping
✓ Lifecycle Management Service for node provisioning,
replacement, administering remediation SOPs
✓ End-to-end monitoring and alerting at the stream, node and
cluster level
✓ Stream access authentication via Identity Service
✓ Data mirroring to support use cases’ HA model as well as their
data movement requirements

Rheos Services
Lifecycle Management Service - a cloud service
that provisions and provides full lifecycle
management for Zookeeper, Kafka, Storm,
MirrorMaker, [soon-to-be-available] Flink clusters
Core Service - consists of these components:
Kafka Proxy Server, Schema Registry, Metadata
System, and Management
Health Check Service - monitors the health of
each asset (for example, a Kafka, Zookeeper, or
MirrorMaker node) that is provisioned through the
Lifecycle Management Service in these aspects:
node state, cluster health, source & sink traffic, lag
and etc.
Mirroring Service - provides high data availability
and integrity by mirroring data from source cluster
to one or more target clusters. This service is also
used to perform data movement across security
zones.

Fun Facts
Rheos @ Scale Alignment with Oracle
232+
OGG producers
200+
streams
> 200 billion
events per day
840+
stream producers
1400+
stream consumers
2500+
compute nodes
90+
Oracle tables
> 28 billion
change events per day
second(s) latency
from DB to Kafka

What’s Next?
✓ Upgrade to Oracle Integrated Extract based solution
✓ Provide Flink as Rheos’ stream processing framework
✓ Full lifecycle management for stream processing
applications
✓ Run Flink and Kafka as Kubernetes cloud-natives

THANK YOU!
Connie Yang
Principal MTS for eBay Data Platform
eBay Software Engineering
October 02, 2017
Presented by

Sushi Principle of Data: “Data is Best Served Raw”

All Enterprise
Data Sources
Sushi Principle of Data: “Data is Best Served Raw”
Poly-
Structured
Relational
RAW
DATA
SCHEMA
EVENTS
<produce>
<produce>
<produce>
Many customers want to
consume their data “raw”
…they prefer it close to the
source of truth
<subscribe>

Raw
Data
Layer
Apps Layer Speed Layer
Batch Layer
State of the Art Data Ingestion: GoldenGate + Kappa
Streaming Analytics
Application
Serving
Layer
REST
Services
Visualization
Tools
Reporting
Tools
Data Marts
Capture
Trail
Route
Deliver
Pump
GG GG
User
Updates
DBMS
Updates
GoldenGate
for Big Data
Supported
Platforms
Kafka
HDFS
Fastest, most scalable and non-invasive way to ingest data into Apache. Benefits of
low-impact on Sources, micro-second access to transactions and ability to replicate
schema (DDL) events for downstream automation of change impact.
GG used with 4 of top 5 largest Kafka clusters in the world…
From user update
to serving layer in
<1 second & no
impact on Source

De-Coupling of the Database: Downstream Processing
Mid-Tier for Log Mine
Eliminate overhead on
DBMS Primary Site
Primary
Secondary Log Mine
GoldenGate
Capture
Trail
Route
Deliver
Pump
Business Apps
Active
DataGuard
WAN
REDO
Transport
Remote DR Host
Eliminate overhead on
DBMS Primary Site
Primary
Secondary
Remote
Standby
GoldenGate
Capture
Trail
Route
Deliver
Pump
Business Apps
AlwaysOn
WAN
AlwaysOn

…But Sometimes Fully Prepared / Cooked is Needed

All Enterprise
Data Sources
Prepared Data: ETL to “Cook” the Data for Consumption
Poly-
Structured
Relational
RAW
DATA
PREPARED
DATA
MASTER
DATA
SCHEMA
EVENTS
ETL ETL
<produce>
<produce>
<produce>
<subscribe>
<subscribe>
Business-oriented
consumers usually
prefer that IT prepare
the data for them

Raw
Data
Layer
Speed Layer
Batch Layer
ETL Pipelines with Data Integrator
Streaming Analytics
Serving
Layer
REST
Services
Visualization
Tools
Reporting
Tools
Data Marts
Oracle Data Integrator
Capture
Trail
Route
Deliver
Pump
GG
SQOOP
API/File
SQOOP
+ Native Loaders
Data Integrator for Big Data
✓ Batch data ingestion with Sqoop,
native loaders & Oozie
✓ Generate data transformations in
Hive, Pig, Spark & Spark
Streaming
✓ Extract data into external DBs,
Files or Cloud
Compare to Informatica / Talend
✓ NoETL Engine native E-LT
execution, 1000’s of references
✓ Zero Footprint does not require
any Oracle install on cluster
✓ Loosely Coupled design time
means you can reuse mapping
logic in many big data languages

All Enterprise
Data Sources
A Common Data Pattern: Access Data from REST/Kafka
Poly-
Structured
Relational
Data
Science
Data
Analysts
Business
Analyst
DBAs
RAW
DATA
PREPARED
DATA
MASTER
DATA
SCHEMA
EVENTS
ETL ETL
<subscribe>
<subscribe>
<subscribe>
<subscribe>
<produce>
<produce>
<produce>

Kappa Data Flow Pattern using Oracle Tech Stack
GoldenGate
Raw Data (LCR)
Schema Events
(DDL)
Prepared Data Topics
Master Data
ETL ETL
1 Topic : 1 Table
Data Consumers
<subscribe>
Applications
Streaming Analytics
ODS (Data Store)
Big Data Lake
Data Warehouses
CQL & Spatial
Analytic Data
Oracle Event Hub
DBMS
Updates
Data Producers
Entire Enterprise
Database Estate
Stream Analytics
Data Integrator
Dev / Test Env.
Oracle Big Data
<generate>
<generate>
API
Management

If Transaction Data Were Food…
Raw Prepared Seared Fully Cooked
Native Source Events Events as JSON Validated JSON Topics Aggregate Topics
LCR$_ROW_RECORD type (LONG, LONGRAW,
or LOB) and contains the following
attributes:
• source_database_name:
• command_type:
• object_owner:
• object_name:
• tag:
• transaction_id:
• scn:
• old_values:
• new_values:
gg.handler.kafkahandler.Format (JSON)
{"address": { "streetAddress": "21
2nd Street", "city": "New York",
"state": "NY", "postalCode": "10021"
}, “ssn": "646554567" }
Topic Policy = phoneNumber(!NULL)
gg.handler.kafkahandler.Format (JSON)
{ "firstName": "John", "lastName":
"Smith", "age": 25, "address":
{ "streetAddress": "21 2nd Street",
"city": "New York", "state": "NY",
"postalCode": "10021" },
"phoneNumber":
[ { "type": "home", "number": "212
555-1234" }, { "type": "fax",
"number": "646 555-4567" }
] }
{ "firstName": "Jonathan",
"lastName": "Smith", "age": 25,
"address":
{ "streetAddress": “101 Main Street",
"city": “San Francisco", "state":
“CA", "postalCode": “27519" },
"phoneNumber":
[ { "type": “cell", "number": "212
555-1234" }, { "type": "fax",
"number": "646 555-4567" }
] }
VERY RAW...........…SYNTACTIC PREPARATION…………RECORD LEVEL VALIDATION……....AGGREGATE DATA
Raw Records: LCRs from
Databases; Log Events from
Web/Mobile; App Events from SaaS
or ERP Applications
Raw Data: sparsely populated
raw records (eg; changes only) but
syntactically normalized in JSON
format
Validated Data: populate the
fully populated record, filter bad
records or light transformations,
records still 1:1 with Source
Master Data: Composite records
have had ETL aggregations and may
have merged attributes from many
sources/topics or joins back to DBs

If Transaction Data Were Food…How Will You Eat Yours?

2017 OpenWorld Keynote for Data Integration

2017 OpenWorld Keynote for Data Integration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2017 OpenWorld Keynote for Data Integration

Similar to 2017 OpenWorld Keynote for Data Integration (20)

More from Jeffrey T. Pollock

More from Jeffrey T. Pollock (18)

Recently uploaded

Recently uploaded (20)

2017 OpenWorld Keynote for Data Integration