尊敬的 微信汇率:1円 ≈ 0.046089 元 支付宝汇率:1円 ≈ 0.04618元 [退出登录]
SlideShare a Scribd company logo
Engineering Machine Learning Data Pipelines
Streaming Data Changes
Paige Roberts
Integrate Product Marketing Manager
Common Machine Learning Applications
Engineering Machine Learning Data Pipelines
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
2
Data Scientist
Engineering Machine Learning Data Pipelines3
Data Engineer to the Rescue
• Expert in statistical analysis, machine learning
techniques, finding answers to business questions
buried in datasets.
• Does NOT want to spend 50 – 90% of their time
tinkering with data, getting it into good shape to
train models – but frequently does, especially if
there’s no data engineer on their team.
• When machine learning model is trained, tested,
and proven it will accomplish the goal, turns it over
to data engineer to productionize. Not skilled at
taking the model from a test sandbox into
production, especially not at large scale.
• Expert in data structures, data manipulation, and
constructing production data pipelines.
• WANTS to spend all of their time working with data,
but usually has more on their plate than they can
keep up with. Anything that will speed up their work
is helpful.
• In most successful companies, is involved from the
beginning. First gathers, cleans and standardizes
data, helps data scientist with feature engineering,
provides top notch data, ready to train models.
• After model is tested, builds robust high scale, data
pipelines to feed the models the data they need in
the correct format in production to provide ongoing
business value.
Data Engineer
Engineering Machine Learning Data Pipelines4
Five Big Challenges of Engineering ML Data Pipelines
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in
incompatible formats, making it difficult to gather and prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools
are not designed to work on that scale of data.
3. Entity Resolution
Distinguishing matches across massive datasets that indicate a single specific entity (person, company,
product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power.
Essentially everything has to be compared to everything else.
4. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in production, in order for models
to accurately make predictions on new data, and for required audit trails. Capture of complete lineage,
from source to end point is needed.
5. Need for Ongoing Real-Time Changed Data Capture and Streaming Data Capture
Tracking and detection needs to happen very rapidly. Current transactions need to be constantly added
to combined datasets, prepared and presented to models as close to real-time as possible.
DMX Change Data Capture
Keep data in sync in real-time
• Without overloading networks.
• Without affecting source database
performance.
• Without coding or tuning.
Reliable transfer of data you can trust even if connectivity fails on either side.
• Auto restart.
• No data loss.
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
Files
RDBMS
Streams
Streams
RDBMS
Data
Lake
Mainframe
Cloud
OLAP
DMX Change Data Capture Sources and Targets
SOURCES
• IBM Db2/z
• IBM Db2/i
• IBM Db2/LUW
• VSAM
• Kafka
• Oracle
• Oracle RAC
Real Application
Clusters
• MS SQL Server
• IBM Informix
• Sybase
TARGETS
• Kafka
• Amazon Kinesis
• Teradata
• HDFS
• Hive
(HDFS, ORC, Avro, Parquet)
• Impala
(Parquet, Kudu)
• IBM Db2
• SQL Server
• MS Azure SQL
• PostgreSQL
• MySQL
• Oracle
• Oracle RAC
• Sybase
• And more …
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
Files
RDBMS
Streams
Streams
RDBMS
Data Hub
Mainframe
Cloud
OLAP
7
Simple Customer Example Architecture
EDGE
NODE
CLUSTER DATA NODES
DATABASE
SOURCES
MAINFRAME
SOURCES
VSAM
Db2
CAPTURE
AGENT
MACHINE
LEARNING
ON SPARK
LONG-TERM
ANALYSES
ON HIVE
BI REPORTING
ON AZURE SQL
8
Log-Based Database to Database
• Captures database changes as they happen
• Transforms and enhances data during replication
• Minimizes bandwidth usage with LAN/WAN
friendly replication
• Ensures data integrity with conflict resolution
and collision monitoring
• Enables tracking and auditing of transactions for
compliance
• Latency – sub-second
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
RDBMS
RDBMS
OLAP
9
Anything to Stream, Stream to Anything, Stream to Stream
• Real-time capture
• Minimizes bandwidth usage with LAN/WAN
friendly replication
• Parallel load on cluster
• Updates HDFS, Hive or Impala, backed by
HDFS, Parquet, ORC, or Kudu.
• Updates even versions of Hive that did not
support updating
• Latency – Real-time, actual SLA varies
depending on update speed of target,
stream settings, etc. Usually, seconds.
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
Files
RDBMS
Streams
Streams
RDBMS
Data
Lake
Mainframe
Cloud
OLAP
Case Study:
Global Hotel Data Kept Current On the Cloud
Syncsort Data Integration and Data Quality for the Cloud10
C H A L L E N G E
• More timely collection & reporting on room availability, event bookings,
inventory and other hotel data from 4,000+ properties globally
S O LU T I O N
• Near real-time reporting - DMX-h consumes property updates from Kafka
every 10 seconds
• DMX-h processes data on HDP, loading to Teradata every 30 minutes
• Deployed on Google Cloud Platform
• Productivity: Leveraging ETL team for Hadoop
(Spark), visual understanding of data pipeline
• Insight: Up-to-date data = better business decisions
= happier customers
B E N E F I T S
• Time to Value: DMX-h ease of use drastically cut development time
• Agility: Global reports updated every 30 min – before 24 hours
11
Log-Based Change Capture to Hadoop
• Real-time capture
• Minimizes bandwidth usage with LAN/WAN
friendly replication
• Parallel load on cluster
• Updates HDFS, Hive or Impala, backed by HDFS,
Parquet, ORC, or Kudu.
• Updates even versions of Hive that did not
support updating
• Latency – Minutes (< 3)
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
Data
Lake
Cloud
Files
RDBMS
Streams
Mainframe
Guardian Life
Insurance
"We found DMX-h to be very
usable and easy to ramp up in
terms of skills. Most of all,
Syncsort has been a very good
partner in terms of support and
listening to our needs.“
– Alex Rosenthal, Enterprise Data Office
Need to enable ML, visualization
and BI on broad range of datasets,
and reduce time-to-market for
analytics projects.
• Reduce data preparation, transformation
times – long delay before new analyses.
• Make data assets available to whole
enterprise – including Mainframe data.
SOLUTION
• Hadoop, NoSQL data lake.
• DMX DataFunnel quickly ingested
hundreds of database
tables at push of a button.
• DMX-h adds new transformed,
standardized data with each new project.
• DMX Change Data Capture pushes
changes from DB2 and other sources to
the data lake in real-time. Current data
up-to-the minute.
12
Data Marketplace –
centralized, reusable, up-to-the-
minute current, searchable,
accessible, managed,
trustworthy data for analytics.
Fast Time-to-Market
for new analytics and reporting.
Symphony
Health Provides
Healthcare Data
Science with
DMX-h
SOLUTION:
Data scientists need fresh data and
constantly seek to do new analyses.
Expensive Oracle solution took days to get data
to data scientists. Required new schemas from
DBA work queues for each new analysis.
Hadoop helped, but expensive ETL tool
bottlenecked all data processing on overloaded
edge node. Blamed poor performance on
unoptimized workflows.
Data available for analysis
in minutes, not days.
• No tuning required: “DMX-h is
already optimized. We use its
Intelligent Execution and it just
performs.”
• Average 3 - 5X processing speed
increase: On one project, dropped
processing times from 20 minutes to
20 seconds.
• No lock-in – If part of a workflow
works better in something like
PySpark, DMX-h makes it easy to
plug in.
“We get the same end result, faster,
cheaper, and with a bigger pool of
developers to draw from who can do the
work. I’m a C# and Java developer who
even knows some Scala, and I still like
using DMX-h because I can get a lot
more done in the same time.”
13
“Before, part of the data
wasn’t available for a day, and
other parts, not for a week.
Now it’s all available for
analysis within minutes of the
data arriving.”
Robert Hathaway
Senior Manager Big Data
• DMX-h
• Apache Spark on Cloudera CDH
• Amazon Redshift
Costs saved both on Hadoop storage and DMX-h
data processing. And, data scientists can define
their own new schemas – no waiting.
DMX-h also does low latency push to Amazon
Redshift for fast, advanced interactive queries,
and so Symphony Health can display results to
clients in web application.
Data scientists can ask more
questions now, find things out sooner.
Engineering Machine Learning Data Pipelines14
Engineering Machine Learning Data Pipelines15

More Related Content

What's hot

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Con LA
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
Joseph D'Antoni
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Cesare Cugnasco
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
Durga Gadiraju
 
How To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQLHow To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQL
DataStax
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data Architecture
MapR Technologies
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
Adam Muise
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
Bigstep
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best Practices
MapR Technologies
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
Impetus Technologies
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
DataWorks Summit/Hadoop Summit
 
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-HadoopHP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
MapR Technologies
 

What's hot (20)

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
How To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQLHow To Tell if Your Business Needs NoSQL
How To Tell if Your Business Needs NoSQL
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data Architecture
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best Practices
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-HadoopHP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
 

Similar to Engineering Machine Learning Data Pipelines Series: Streaming New Data as It Changes

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Precisely
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Avere Systems
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
Cloudera, Inc.
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
Amazon Web Services
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
Amazon Web Services
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
Kognitio
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
Michael Hiskey
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Big Data
Big DataBig Data
Big Data
Neha Mehta
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 

Similar to Engineering Machine Learning Data Pipelines Series: Streaming New Data as It Changes (20)

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big Data
Big DataBig Data
Big Data
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 

More from Precisely

Automate Studio Training: Easy Loop Creation for Greater Efficiency.pdf
Automate Studio Training: Easy Loop Creation for Greater Efficiency.pdfAutomate Studio Training: Easy Loop Creation for Greater Efficiency.pdf
Automate Studio Training: Easy Loop Creation for Greater Efficiency.pdf
Precisely
 
Making Your Data and AI Ready for Business Transformation.pdf
Making Your Data and AI Ready for Business Transformation.pdfMaking Your Data and AI Ready for Business Transformation.pdf
Making Your Data and AI Ready for Business Transformation.pdf
Precisely
 
Getting a Deeper Look at Your IBM® Z and IBM i Data in ServiceNow
Getting a Deeper Look at Your IBM® Z and IBM i Data in ServiceNowGetting a Deeper Look at Your IBM® Z and IBM i Data in ServiceNow
Getting a Deeper Look at Your IBM® Z and IBM i Data in ServiceNow
Precisely
 
Predictive Powerhouse - Elevating AI ML Accuracy and Relevance with Third-Par...
Predictive Powerhouse - Elevating AI ML Accuracy and Relevance with Third-Par...Predictive Powerhouse - Elevating AI ML Accuracy and Relevance with Third-Par...
Predictive Powerhouse - Elevating AI ML Accuracy and Relevance with Third-Par...
Precisely
 
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party DataPredictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Precisely
 
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party DataPredictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Precisely
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
信頼できるデータでESGイニシアチブを成功に導く方法.pdf How to drive success with ESG initiatives with...
信頼できるデータでESGイニシアチブを成功に導く方法.pdf How to drive success with ESG initiatives with...信頼できるデータでESGイニシアチブを成功に導く方法.pdf How to drive success with ESG initiatives with...
信頼できるデータでESGイニシアチブを成功に導く方法.pdf How to drive success with ESG initiatives with...
Precisely
 
AI-Ready Data - The Key to Transforming Projects into Production.pptx
AI-Ready Data - The Key to Transforming Projects into Production.pptxAI-Ready Data - The Key to Transforming Projects into Production.pptx
AI-Ready Data - The Key to Transforming Projects into Production.pptx
Precisely
 
Building a Multi-Layered Defense for Your IBM i Security
Building a Multi-Layered Defense for Your IBM i SecurityBuilding a Multi-Layered Defense for Your IBM i Security
Building a Multi-Layered Defense for Your IBM i Security
Precisely
 
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdfOptimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Precisely
 
Chaining, Looping, and Long Text for Script Development and Automation.pdf
Chaining, Looping, and Long Text for Script Development and Automation.pdfChaining, Looping, and Long Text for Script Development and Automation.pdf
Chaining, Looping, and Long Text for Script Development and Automation.pdf
Precisely
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Precisely
 
Navigating the Cloud: Best Practices for Successful Migration
Navigating the Cloud: Best Practices for Successful MigrationNavigating the Cloud: Best Practices for Successful Migration
Navigating the Cloud: Best Practices for Successful Migration
Precisely
 
Unlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Unlocking the Power of Your IBM i and Z Security Data with Google ChronicleUnlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Unlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Precisely
 
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdfHow to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
Precisely
 
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Precisely
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
Precisely
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdf
Precisely
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Precisely
 

More from Precisely (20)

Automate Studio Training: Easy Loop Creation for Greater Efficiency.pdf
Automate Studio Training: Easy Loop Creation for Greater Efficiency.pdfAutomate Studio Training: Easy Loop Creation for Greater Efficiency.pdf
Automate Studio Training: Easy Loop Creation for Greater Efficiency.pdf
 
Making Your Data and AI Ready for Business Transformation.pdf
Making Your Data and AI Ready for Business Transformation.pdfMaking Your Data and AI Ready for Business Transformation.pdf
Making Your Data and AI Ready for Business Transformation.pdf
 
Getting a Deeper Look at Your IBM® Z and IBM i Data in ServiceNow
Getting a Deeper Look at Your IBM® Z and IBM i Data in ServiceNowGetting a Deeper Look at Your IBM® Z and IBM i Data in ServiceNow
Getting a Deeper Look at Your IBM® Z and IBM i Data in ServiceNow
 
Predictive Powerhouse - Elevating AI ML Accuracy and Relevance with Third-Par...
Predictive Powerhouse - Elevating AI ML Accuracy and Relevance with Third-Par...Predictive Powerhouse - Elevating AI ML Accuracy and Relevance with Third-Par...
Predictive Powerhouse - Elevating AI ML Accuracy and Relevance with Third-Par...
 
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party DataPredictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
 
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party DataPredictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
Predictive Powerhouse: Elevating AI Accuracy and Relevance with Third-Party Data
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
信頼できるデータでESGイニシアチブを成功に導く方法.pdf How to drive success with ESG initiatives with...
信頼できるデータでESGイニシアチブを成功に導く方法.pdf How to drive success with ESG initiatives with...信頼できるデータでESGイニシアチブを成功に導く方法.pdf How to drive success with ESG initiatives with...
信頼できるデータでESGイニシアチブを成功に導く方法.pdf How to drive success with ESG initiatives with...
 
AI-Ready Data - The Key to Transforming Projects into Production.pptx
AI-Ready Data - The Key to Transforming Projects into Production.pptxAI-Ready Data - The Key to Transforming Projects into Production.pptx
AI-Ready Data - The Key to Transforming Projects into Production.pptx
 
Building a Multi-Layered Defense for Your IBM i Security
Building a Multi-Layered Defense for Your IBM i SecurityBuilding a Multi-Layered Defense for Your IBM i Security
Building a Multi-Layered Defense for Your IBM i Security
 
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdfOptimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
Optimierte Daten und Prozesse mit KI / ML + SAP Fiori.pdf
 
Chaining, Looping, and Long Text for Script Development and Automation.pdf
Chaining, Looping, and Long Text for Script Development and Automation.pdfChaining, Looping, and Long Text for Script Development and Automation.pdf
Chaining, Looping, and Long Text for Script Development and Automation.pdf
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
 
Navigating the Cloud: Best Practices for Successful Migration
Navigating the Cloud: Best Practices for Successful MigrationNavigating the Cloud: Best Practices for Successful Migration
Navigating the Cloud: Best Practices for Successful Migration
 
Unlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Unlocking the Power of Your IBM i and Z Security Data with Google ChronicleUnlocking the Power of Your IBM i and Z Security Data with Google Chronicle
Unlocking the Power of Your IBM i and Z Security Data with Google Chronicle
 
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdfHow to Build Data Governance Programs That Last - A Business-First Approach.pdf
How to Build Data Governance Programs That Last - A Business-First Approach.pdf
 
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter MassendatenZukuntssichere SAP Prozesse dank automatisierter Massendaten
Zukuntssichere SAP Prozesse dank automatisierter Massendaten
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Recently uploaded

Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
Neeraj Kumar Singh
 
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
SOFTTECHHUB
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
manji sharman06
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Cyber Recovery Wargame
Cyber Recovery WargameCyber Recovery Wargame
Cyber Recovery Wargame
Databarracks
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes
 
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
Mydbops
 
Brightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentationBrightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentation
ILC- UK
 
Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0
Neeraj Kumar Singh
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
Enterprise Knowledge
 
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
UiPathCommunity
 
Getting Started Using the National Research Platform
Getting Started Using the National Research PlatformGetting Started Using the National Research Platform
Getting Started Using the National Research Platform
Larry Smarr
 
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer ExperienceHow to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
Aggregage
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
ScyllaDB
 
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
dipikamodels1
 
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
Prasta Maha
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
UmmeSalmaM1
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
ScyllaDB
 

Recently uploaded (20)

Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
 
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Cyber Recovery Wargame
Cyber Recovery WargameCyber Recovery Wargame
Cyber Recovery Wargame
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
 
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
 
Brightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentationBrightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentation
 
Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
 
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
 
Getting Started Using the National Research Platform
Getting Started Using the National Research PlatformGetting Started Using the National Research Platform
Getting Started Using the National Research Platform
 
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer ExperienceHow to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
 
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
 
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
 

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It Changes

  • 1. Engineering Machine Learning Data Pipelines Streaming Data Changes Paige Roberts Integrate Product Marketing Manager
  • 2. Common Machine Learning Applications Engineering Machine Learning Data Pipelines • Anti-money laundering • Fraud detection • Cybersecurity • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention • Know your customer 2
  • 3. Data Scientist Engineering Machine Learning Data Pipelines3 Data Engineer to the Rescue • Expert in statistical analysis, machine learning techniques, finding answers to business questions buried in datasets. • Does NOT want to spend 50 – 90% of their time tinkering with data, getting it into good shape to train models – but frequently does, especially if there’s no data engineer on their team. • When machine learning model is trained, tested, and proven it will accomplish the goal, turns it over to data engineer to productionize. Not skilled at taking the model from a test sandbox into production, especially not at large scale. • Expert in data structures, data manipulation, and constructing production data pipelines. • WANTS to spend all of their time working with data, but usually has more on their plate than they can keep up with. Anything that will speed up their work is helpful. • In most successful companies, is involved from the beginning. First gathers, cleans and standardizes data, helps data scientist with feature engineering, provides top notch data, ready to train models. • After model is tested, builds robust high scale, data pipelines to feed the models the data they need in the correct format in production to provide ongoing business value. Data Engineer
  • 4. Engineering Machine Learning Data Pipelines4 Five Big Challenges of Engineering ML Data Pipelines 1. Scattered and Difficult to Access Datasets Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in incompatible formats, making it difficult to gather and prepare the data for model training. 2. Data Cleansing at Scale Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools are not designed to work on that scale of data. 3. Entity Resolution Distinguishing matches across massive datasets that indicate a single specific entity (person, company, product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power. Essentially everything has to be compared to everything else. 4. Tracking Lineage from the Source Data changes made to help train models have to be exactly duplicated in production, in order for models to accurately make predictions on new data, and for required audit trails. Capture of complete lineage, from source to end point is needed. 5. Need for Ongoing Real-Time Changed Data Capture and Streaming Data Capture Tracking and detection needs to happen very rapidly. Current transactions need to be constantly added to combined datasets, prepared and presented to models as close to real-time as possible.
  • 5. DMX Change Data Capture Keep data in sync in real-time • Without overloading networks. • Without affecting source database performance. • Without coding or tuning. Reliable transfer of data you can trust even if connectivity fails on either side. • Auto restart. • No data loss. Real-Time Replication with Transformation Conflict Resolution, Collision Monitoring, Tracking and Auditing Files RDBMS Streams Streams RDBMS Data Lake Mainframe Cloud OLAP
  • 6. DMX Change Data Capture Sources and Targets SOURCES • IBM Db2/z • IBM Db2/i • IBM Db2/LUW • VSAM • Kafka • Oracle • Oracle RAC Real Application Clusters • MS SQL Server • IBM Informix • Sybase TARGETS • Kafka • Amazon Kinesis • Teradata • HDFS • Hive (HDFS, ORC, Avro, Parquet) • Impala (Parquet, Kudu) • IBM Db2 • SQL Server • MS Azure SQL • PostgreSQL • MySQL • Oracle • Oracle RAC • Sybase • And more … Real-Time Replication with Transformation Conflict Resolution, Collision Monitoring, Tracking and Auditing Files RDBMS Streams Streams RDBMS Data Hub Mainframe Cloud OLAP
  • 7. 7 Simple Customer Example Architecture EDGE NODE CLUSTER DATA NODES DATABASE SOURCES MAINFRAME SOURCES VSAM Db2 CAPTURE AGENT MACHINE LEARNING ON SPARK LONG-TERM ANALYSES ON HIVE BI REPORTING ON AZURE SQL
  • 8. 8 Log-Based Database to Database • Captures database changes as they happen • Transforms and enhances data during replication • Minimizes bandwidth usage with LAN/WAN friendly replication • Ensures data integrity with conflict resolution and collision monitoring • Enables tracking and auditing of transactions for compliance • Latency – sub-second Real-Time Replication with Transformation Conflict Resolution, Collision Monitoring, Tracking and Auditing RDBMS RDBMS OLAP
  • 9. 9 Anything to Stream, Stream to Anything, Stream to Stream • Real-time capture • Minimizes bandwidth usage with LAN/WAN friendly replication • Parallel load on cluster • Updates HDFS, Hive or Impala, backed by HDFS, Parquet, ORC, or Kudu. • Updates even versions of Hive that did not support updating • Latency – Real-time, actual SLA varies depending on update speed of target, stream settings, etc. Usually, seconds. Real-Time Replication with Transformation Conflict Resolution, Collision Monitoring, Tracking and Auditing Files RDBMS Streams Streams RDBMS Data Lake Mainframe Cloud OLAP
  • 10. Case Study: Global Hotel Data Kept Current On the Cloud Syncsort Data Integration and Data Quality for the Cloud10 C H A L L E N G E • More timely collection & reporting on room availability, event bookings, inventory and other hotel data from 4,000+ properties globally S O LU T I O N • Near real-time reporting - DMX-h consumes property updates from Kafka every 10 seconds • DMX-h processes data on HDP, loading to Teradata every 30 minutes • Deployed on Google Cloud Platform • Productivity: Leveraging ETL team for Hadoop (Spark), visual understanding of data pipeline • Insight: Up-to-date data = better business decisions = happier customers B E N E F I T S • Time to Value: DMX-h ease of use drastically cut development time • Agility: Global reports updated every 30 min – before 24 hours
  • 11. 11 Log-Based Change Capture to Hadoop • Real-time capture • Minimizes bandwidth usage with LAN/WAN friendly replication • Parallel load on cluster • Updates HDFS, Hive or Impala, backed by HDFS, Parquet, ORC, or Kudu. • Updates even versions of Hive that did not support updating • Latency – Minutes (< 3) Real-Time Replication with Transformation Conflict Resolution, Collision Monitoring, Tracking and Auditing Data Lake Cloud Files RDBMS Streams Mainframe
  • 12. Guardian Life Insurance "We found DMX-h to be very usable and easy to ramp up in terms of skills. Most of all, Syncsort has been a very good partner in terms of support and listening to our needs.“ – Alex Rosenthal, Enterprise Data Office Need to enable ML, visualization and BI on broad range of datasets, and reduce time-to-market for analytics projects. • Reduce data preparation, transformation times – long delay before new analyses. • Make data assets available to whole enterprise – including Mainframe data. SOLUTION • Hadoop, NoSQL data lake. • DMX DataFunnel quickly ingested hundreds of database tables at push of a button. • DMX-h adds new transformed, standardized data with each new project. • DMX Change Data Capture pushes changes from DB2 and other sources to the data lake in real-time. Current data up-to-the minute. 12 Data Marketplace – centralized, reusable, up-to-the- minute current, searchable, accessible, managed, trustworthy data for analytics. Fast Time-to-Market for new analytics and reporting.
  • 13. Symphony Health Provides Healthcare Data Science with DMX-h SOLUTION: Data scientists need fresh data and constantly seek to do new analyses. Expensive Oracle solution took days to get data to data scientists. Required new schemas from DBA work queues for each new analysis. Hadoop helped, but expensive ETL tool bottlenecked all data processing on overloaded edge node. Blamed poor performance on unoptimized workflows. Data available for analysis in minutes, not days. • No tuning required: “DMX-h is already optimized. We use its Intelligent Execution and it just performs.” • Average 3 - 5X processing speed increase: On one project, dropped processing times from 20 minutes to 20 seconds. • No lock-in – If part of a workflow works better in something like PySpark, DMX-h makes it easy to plug in. “We get the same end result, faster, cheaper, and with a bigger pool of developers to draw from who can do the work. I’m a C# and Java developer who even knows some Scala, and I still like using DMX-h because I can get a lot more done in the same time.” 13 “Before, part of the data wasn’t available for a day, and other parts, not for a week. Now it’s all available for analysis within minutes of the data arriving.” Robert Hathaway Senior Manager Big Data • DMX-h • Apache Spark on Cloudera CDH • Amazon Redshift Costs saved both on Hadoop storage and DMX-h data processing. And, data scientists can define their own new schemas – no waiting. DMX-h also does low latency push to Amazon Redshift for fast, advanced interactive queries, and so Symphony Health can display results to clients in web application. Data scientists can ask more questions now, find things out sooner.
  • 14. Engineering Machine Learning Data Pipelines14
  • 15. Engineering Machine Learning Data Pipelines15
  翻译: