尊敬的 微信汇率:1円 ≈ 0.046078 元 支付宝汇率:1円 ≈ 0.046168元 [退出登录]
SlideShare a Scribd company logo
Cost Optimization for Hadoop/Spark
Workloads with Amazon EMR
Presented by:
June 2, 2020
Pritpal Sahota
Technical Account Manager
Provectus
Stepan Pushkarev
Chief Technical Officer
Provectus
Nirav Shah
Senior Solution Architect
Amazon Web Services
Perry Peterson
Business Development Manager
Amazon Web Services
1. Provide significant value on how to optimize the cost by migrating to
Amazon EMR
1. Hadoop-Spark workloads to Amazon EMR migration risk mitigation and
best practices
Webinar Objectives
• Introduction
• Hadoop market and Cost optimizations using Amazon EMR
• Cost related and other challenges of on-prem Hadoop clusters
• Cost optimizations by using Amazon EMR and migration best
practices
• Amazon EMR migration acceleration workshop overview
Agenda
Stepan Pushkarev
Chief Technology
Officer
Provectus
Pritpal Sahota
Technical Account
Executive
Provectus
Presenters
Nirav Shah
Senior Solutions
Architect
Amazon Web Services
Perry Peterson
Business Development
Manager – Analytics
Amazon Web Services
AWS Partner Network (APN) Premier Consulting Partner
AI-first Consultancy & Solutions Provider
Сlients ranging from
fast-growing startups
through large
enterprises
450 employees and
growing
Established in 2010
HQ in Palo Alto
Offices across the US,
Canada, and Europe
Machine Learning
Employ analytical algorithms
to unveil hidden value from
raw data that helps solve
business challenges
DevOps/DevSecOps
Improve development and
delivery pipelines to bring
your product to the market
faster and resiliently
Next Gen Cloud
Modernize your application
and data landscape to allow
for more agility and better
service to your customers
Big Data
Gain data-driven insights
through the holistic data
analysis made available with
a big data platform
AWS Competencies in Machine Learning, Data & Analytics, and DevOps
Core Competencies
Innovative Tech Vendors
Seeking for niche expertise to
differentiate and win the market
Enterprises
Seeking to accelerate innovation,
achieve operational excellence
Clientele
Hadoop Market and Cost Optimization
using Amazon EMR
Rapid growth of cloud adoption in big data space
7.5x faster than on-prem installs as per Forrester Research
Uncertainty with leading Hadoop commercial vendors
Leading commercial Hadoop vendors face uncertainty & headwinds. Customers are
exploring cloud to leverage cost benefits, flexibility, scalability, & performance per price
Large & growing Hadoop market
According to market study report, over the next five years the Hadoop market
will register a 33% annual revenue growth with market size reaching $9.4B by 2024
Availability of Resources
Big data engineers prefer to work on cloud based big data solutions
Hadoop market
Amazon EMR is an enterprise-grade Spark/ Hadoop managed service helping businesses, researchers, data analysts, and developers to process and
analyze vast amounts of data. EMR solves complex technical/business challenges: clickstream and log analysis along with real-time and predictive
analytics. In comparison to on-premises deployments, IDC confirms Amazon EMR provides year 1 savings of 57% and 342% ROI over 5 years.
What is EMR & where is it in the Analytics stack?
EMR powers most cloud Hadoop/Spark projects
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights
reserved.
processes 135B events/day and have cost savings of 60% (~$20M)
decreased costs by $600k in less than 5 months
saves 75% and is 60% more efficient
achieves costs savings of 55% when compared to on-demand
pricing and 40% savings when compared to Reserved Instances
High-impact results with Amazon EMR
near real-time analytics for 140M players
scales 3,000 transient clusters on a daily basis
powers the Predix solution processing 1M data executions/day
computes Zestimates on 100M +homes in hours instead of 1 day
reduced cost of operation and improved Spark performance 3x
High-impact results with Amazon EMR
NinthDecimal is the omnichannel marketing platform
helping Fortune 500 brands identify new prospects and
customers, drive store visits, and increase sales using
AI- and data-driven consumer intelligence.
Ninthdecimal is seeing 3x speedup for Spark workloads
on Amazon EMR and 3-5x of cost reduction. It means
better SLAs for delivering insights to the clients and
improved bottom line of the business.
IMVU is the world’s largest avatar-based social network
serving 6M+ players and 40M+ virtual goods
IMVU has migrated 450+ Spark & Hive jobs and re-
architected monolithic Hadoop environment into
transient Amazon EMR clusters orchestrated with
Airflow pipelines.
By moving to AWS and Amazon EMR saved 30% of
costs and became 80% more efficient in data
engineering and analytics.
57%
reduction in cost of ownership
342%
five-year ROI
8 months
to breakeven
99%
reduction in unplanned downtime
33%
more efficient Big Data teams
46%
more efficient Big Data/Hadoop management staff
Referenced IDC White Paper: "The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR"
IDC study: Hadoop to Amazon EMR migration
Amazon EMR Migration patterns
and Best Practices Overview
Amazon EMR Migration Patterns
On-
Premise
s
Lift & Shift Instance
Right-Sizing
S3 vs.
HDFS
Transient
clusters
● Lift & Shift
a. Low Risk & Lowest migration cost
b. Very high ongoing cost
c. Low business value addition
d. quickest time to market
● Re-Architect - Migrate to Amazon EMR with a new architecture with
complementary services to optimize the cost and to provide additional
functionality, scalability, flexibility etc.
a. Medium risk, Medium Migration cost
b. Medium ongoing cost
c. High business value addition
d. Medium time to market
● Next Gen Architecture - Migrate to Amazon EMR with a completely new
architecture which may include Streaming, Containers with added
functionality, scalability, flexibility etc.
a. High risk, Highest Migration Cost
b. Lowest ongoing cost
c. Highest business value addition
d. Longest time to market
An approach to best practice deployment
Go beyond a lift & shift to optimize for scale and cost.
On-Premises Lift & Shift Instance
Right-Sizing
Amazon
S3 vs.
HDFS
Transient
clusters
Auto-
scaling
Spot
Pricing
Automated
Orchestration
Amazon
EMR
Optimized
True TCO
comparison
Business factors:
Capex->Opex
On-prem license fees
Maintenance Overhead
Uncertainty in Hadoop
Vendors
Lowest pricing comparing to
other Hadoop/Spark premium
vendors
Amazon EMR Value Add:
Decoupled Storage & Compute
Transient clusters
Spot pricing
Autoscaling
Optimised hardware
Amazon S3 lifecycle
Proprietary Spark Amazon
EMR engine
Next Gen Architecture Value Add:
Data Pipelines optimization
Streaming processing
Serverless ETL
Serverless ad-hoc queries
Serverless Data Catalog
Workloads decomposition
(Amazon EMR, Amazon Redshift,
Athena, SageMaker)
10-20% Cost Reduction + 10-40% Reduction + 20-90% Reduction
Overview of Cost Optimization Factors
Migration Risk Mitigation Strategies
On-
Premise
s
Lift & Shift Instance
Right-Sizing
S3 vs.
HDFS
Transient
clusters
Auto-
scaling
Spot
Pricing
Automated
Orchestration
EMR
Optimize
d
True TCO
compariso
n
● Analyze all application and workloads to ascertain
compute, memory, storage, run time of day/week/month
and any other infrastructure needs
● Develop a Business Value and Implementation
Complexity Model for all applications and workloads,
Plot business value vs. complexity Prioritization Matrix
● Organized Mirroring of Data loads on to Amazon EMR
cluster with on-prem Hadoop cluster
● Start moving Workloads on to Amazon EMR in an orderly
fashion.
● Identify excited innovators within each business unit to
promote and spread on-prem to Amazon EMR migration
● Work with experts like Provectus to lead this effort.
Complexity
BusinessValue
A
D
B
C
F
E
G
Initial Workloads to
migrate
1. Build a business case of Amazon EMR Migration including comparative cost
analysis
2. Develop a risk mitigation plan
3. Design Next-Gen Data Platform and its adoption roadmap
4. Hands-on execute migration and re-architecture
How Provectus can help
Cost and other challenges of On-Prem
Hadoop/Spark Environments
Compute and storage growth
Tightly
coupled
● Storage grows along with
compute
● Compute requirements vary
3x
● Data is replicated several times
● Typically only on one data center
Underutilized or scarce resources
40
20
0
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
60
80
100
120
Re-processing
Weekly peaks
Steady state
Contention for the same resources
Compute
bound
Memory
bound
With a monolithic cluster, there may be dependencies of downstream applications that impact
the inability to upgrade versions. By not upgrading, organizations could be limiting innovation.
● Large Scale Transformation: Map/Reduce, Hive, Pig, Spark
● Interactive Queries: Impala, Spark SQL, Presto
● Machine Learning: Spark ML, MxNet, Tensorflow
● Interactive Notebooks: Jupyter, Zeppelin
● NoSQL: HBase
Limited on fast following app versions
Cost Optimization using Amazon EMR
Amazon EMR Benefits
Amazon S3 is your persistent storage - 99.9999999% durability, Low cost and
many varieties, Life cycle policies, Versioning, Distributed by default, and EMRFS
Decouple storage and compute
Turn off the cluster
Auto-scaling | Persistent & transient clusters
Logical separation of jobs/applications
Re-architect Monolithic to Purpose-built
clusters by:
• Creating Transient and/or Persistent clusters
• Separating clusters by Application
• Separating clusters by Application Version
• Isolating Department specific clusters
Design consideration are given to:
• How you submit jobs or build pipelines
• Persisting your data in Amazon S3
• Storing metadata off the cluster
• How long the job runs
• What applications are needed
Purpose-built Clusters
Traditional Monolithic Cluster
Built-in disaster recovery
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Availability Zone
Parallelization on Spot can drastically reduce time-to-insight and cost.
Example 1: Baseline example of using RI
10 node cluster running for 14 hours
Cost = $1.0 * 10 nodes * 14 hours = $140
Example 2: Scale more nodes with Spot
Add 10 more nodes of Spot at 50% discount
20 node cluster running for 7 hours
Cost = $1.0 * 10 nodes * 7 hours = $70
= $0.5 * 10 nodes * 7 hours = $35
Total $105
Auto-scale nodes with Spot instances
● The EMR Runtime for Apache Spark available in Amazon EMR
v5.28 realized Spark improvements of up to 32x against TPC-
DS 3TB dataset in comparison to Amazon EMR v5.16
(reference)
● The Amazon EMR Runtime for Apache Spark maintains API
compatibility with OSS Spark
● More coming every release
Spark performance improvements
Analysts confirm lowest TCO
Feb. 2019, Forrester recognizes
Amazon EMR as the Cloud
Hadoop/Spark (HARK) Leader.
Nov. 2018, IDC report confirms:
“EMR provides 57% reduced costs
vs. on premise resulting in 342%
ROI over 5 years.”
Dec. 2018, Gartner suggests:
“AWS remains the largest
Hadoop provider in terms of
both revenue and user base.”
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and
Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester
Wave™ is a graphical representation of Forrester's call on a market and is
plotted using a detailed spreadsheet with exposed scores, weightings, and
comments. Forrester does not endorse any vendor, product, or service
depicted in the Forrester Wave™. Information is based on best available
resources. Opinions reflect judgment at the time and are subject to change.
Benefits Summary
1. Decoupled compute & storage
2. Built-in disaster recovery
3. Turn off your clusters after use
4. Agility of auto-scaling of the clusters
5. Leverage Spot pricing for unused Amazon EC2 capacity
6. Self-service with AWS Service Catalog
7. Spark performance improvements
8. Fully managed Amazon EMR Notebooks
9. Centralized assets and data pipeline orchestration
10. Lowest TCO in the Industry, analysts confirm
11. Amazon EMR is surrounded by the industry’s broadest
analytics ecosystem
The Next-Gen Ecosystem
that Supports You
Serverless analytics
Amazon S3
Data lake
AWS Glue
(ETL &
Data Catalog)
Athena
QuickSight
Serverless. Zero
infrastructure. Zero
administration
$
Never pay for
idle resources
Availability and
fault tolerance
built in
Automatically
scales resources
with usage
AWS IoT
AI/ML
Devices Web
Sensors
Social
AWS Glue
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python, Scala, and
Spark
Data Catalog
• Glue crawlers automatically discovers data and
stores schema
• Catalog makes data searchable, and available
for ETL and queries
• Computes statistics to make queries efficiently
Serverless ETL & Data Catalogue
ETL
• Generates customizable code for common file
type conversion and partitioning
• Schedules and runs your ETL jobs
• Serverless, flexible, and built on open standards
Amazon Athena
Zero setup cost; just point to
Amazon S3 and start querying
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types, and
complex joins and data
types
Serverless: zero
infrastructure, zero
administration
Integrated with QuickSight
Pay only for queries run;
save 30–90% on per- query
costs through compression
Query Instantly Open EasyPay per query
Serverless Interactive Query engine
• Interactive query service to analyze data in Amazon S3 using standard SQL
• No infrastructure to set up or manage and no data to load
• Ability to run SQL queries on data archived in Amazon S3 Glacier
SQL
90% of your
Hadoop Costs
Hadoop Common Pipeline Pattern 1
90% of your
Hadoop Costs
Hadoop Common Pipeline Pattern 2
2-3x of cost
reduction
From Big Data to Fast Data
125 University Avenue
Suite 290, Palo Alto
California, 94301
provectus.com
Questions, details?
We would be happy to answer!

More Related Content

What's hot

Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
Amazon Web Services
 
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Amazon Web Services
 
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF LoftData Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
Amazon Web Services
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Alluxio, Inc.
 
Getting Started with Amazon DynamoDB
Getting Started with Amazon DynamoDBGetting Started with Amazon DynamoDB
Getting Started with Amazon DynamoDB
Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
What's New in Amazon Relational Database Service (DAT203) - AWS re:Invent 2018
What's New in Amazon Relational Database Service (DAT203) - AWS re:Invent 2018What's New in Amazon Relational Database Service (DAT203) - AWS re:Invent 2018
What's New in Amazon Relational Database Service (DAT203) - AWS re:Invent 2018
Amazon Web Services
 
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Amazon Web Services
 
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data WorkloadsWorkload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Vasu S
 
Best Practices for Running Oracle Databases on Amazon RDS (DAT317) - AWS re:I...
Best Practices for Running Oracle Databases on Amazon RDS (DAT317) - AWS re:I...Best Practices for Running Oracle Databases on Amazon RDS (DAT317) - AWS re:I...
Best Practices for Running Oracle Databases on Amazon RDS (DAT317) - AWS re:I...
Amazon Web Services
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
Amazon Web Services
 
Builders Day' - Databases on AWS: The Right Tool for The Right Job
Builders Day' - Databases on AWS: The Right Tool for The Right JobBuilders Day' - Databases on AWS: The Right Tool for The Right Job
Builders Day' - Databases on AWS: The Right Tool for The Right Job
Amazon Web Services LATAM
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Amazon Web Services
 
BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2
Paulraj Pappaiah
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 
New Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesNew Database Migration Services & RDS Updates
New Database Migration Services & RDS Updates
Amazon Web Services
 
What's New in Amazon Aurora (DAT204-R1) - AWS re:Invent 2018
What's New in Amazon Aurora (DAT204-R1) - AWS re:Invent 2018What's New in Amazon Aurora (DAT204-R1) - AWS re:Invent 2018
What's New in Amazon Aurora (DAT204-R1) - AWS re:Invent 2018
Amazon Web Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
Amazon Web Services
 

What's hot (20)

Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
Accelerate Database Development and Testing with Amazon Aurora (DAT313) - AWS...
 
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
 
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF LoftData Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using AlluxioBursting on-premise analytic workloads to Amazon EMR using Alluxio
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
 
Getting Started with Amazon DynamoDB
Getting Started with Amazon DynamoDBGetting Started with Amazon DynamoDB
Getting Started with Amazon DynamoDB
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
What's New in Amazon Relational Database Service (DAT203) - AWS re:Invent 2018
What's New in Amazon Relational Database Service (DAT203) - AWS re:Invent 2018What's New in Amazon Relational Database Service (DAT203) - AWS re:Invent 2018
What's New in Amazon Relational Database Service (DAT203) - AWS re:Invent 2018
 
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
 
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data WorkloadsWorkload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
Workload-Aware: Auto-Scaling A new paradigm for Big Data Workloads
 
Best Practices for Running Oracle Databases on Amazon RDS (DAT317) - AWS re:I...
Best Practices for Running Oracle Databases on Amazon RDS (DAT317) - AWS re:I...Best Practices for Running Oracle Databases on Amazon RDS (DAT317) - AWS re:I...
Best Practices for Running Oracle Databases on Amazon RDS (DAT317) - AWS re:I...
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
 
Builders Day' - Databases on AWS: The Right Tool for The Right Job
Builders Day' - Databases on AWS: The Right Tool for The Right JobBuilders Day' - Databases on AWS: The Right Tool for The Right Job
Builders Day' - Databases on AWS: The Right Tool for The Right Job
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
New Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesNew Database Migration Services & RDS Updates
New Database Migration Services & RDS Updates
 
What's New in Amazon Aurora (DAT204-R1) - AWS re:Invent 2018
What's New in Amazon Aurora (DAT204-R1) - AWS re:Invent 2018What's New in Amazon Aurora (DAT204-R1) - AWS re:Invent 2018
What's New in Amazon Aurora (DAT204-R1) - AWS re:Invent 2018
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 

Similar to Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR

Effective Cost Management for Amazon EMR
Effective Cost Management for Amazon EMREffective Cost Management for Amazon EMR
Effective Cost Management for Amazon EMR
DevOps.com
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Amazon Web Services
 
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Germany
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Yahoo Developer Network
 
Deep dive session - sap and aws - extend and innovate
Deep dive session - sap and aws - extend and innovateDeep dive session - sap and aws - extend and innovate
Deep dive session - sap and aws - extend and innovate
Ritesh Toshniwal
 
Big dataandhp cforawsbrasilsummit
Big dataandhp cforawsbrasilsummitBig dataandhp cforawsbrasilsummit
Big dataandhp cforawsbrasilsummit
Amazon Web Services LATAM
 
Simplify Your Migration to AWS and Cut Costs by 30% with TSO Logic
 Simplify Your Migration to AWS and Cut Costs by 30% with TSO Logic Simplify Your Migration to AWS and Cut Costs by 30% with TSO Logic
Simplify Your Migration to AWS and Cut Costs by 30% with TSO Logic
Amazon Web Services
 
Amazon Web Services - The New Normal
Amazon Web Services - The New NormalAmazon Web Services - The New Normal
Amazon Web Services - The New Normal
Innovation Strategies
 
Sap on aws webinar on reducing tco 07092017
Sap on aws  webinar on reducing tco 07092017Sap on aws  webinar on reducing tco 07092017
Sap on aws webinar on reducing tco 07092017
Krishnan K ☁
 
Blogthetech why are companies investing billions in sap implementation
Blogthetech why are companies investing billions in sap implementationBlogthetech why are companies investing billions in sap implementation
Blogthetech why are companies investing billions in sap implementation
HarryJake1
 
Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS Cloud
Amazon Web Services
 
AWS webinar what is cloud computing 13 09 11
AWS webinar what is cloud computing 13 09 11AWS webinar what is cloud computing 13 09 11
AWS webinar what is cloud computing 13 09 11
Amazon Web Services
 
Cloud Economics: Transform Businesses at Lower Costs - AWS Summit Bahrain 2017
Cloud Economics: Transform Businesses at Lower Costs - AWS Summit Bahrain 2017Cloud Economics: Transform Businesses at Lower Costs - AWS Summit Bahrain 2017
Cloud Economics: Transform Businesses at Lower Costs - AWS Summit Bahrain 2017
Amazon Web Services
 
SAP on AWS | Scottsdale, AZ
SAP on AWS | Scottsdale, AZSAP on AWS | Scottsdale, AZ
SAP on AWS | Scottsdale, AZ
Amazon Web Services
 
Aws what is cloud computing deck 08 14 13
Aws what is cloud computing deck 08 14 13Aws what is cloud computing deck 08 14 13
Aws what is cloud computing deck 08 14 13
Amazon Web Services
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Web Services
 
Hong Kong AWS Summit 2017 - Keynote
Hong Kong AWS Summit 2017 - KeynoteHong Kong AWS Summit 2017 - Keynote
Hong Kong AWS Summit 2017 - Keynote
Amazon Web Services
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
Amazon Web Services
 
Auckland Summit Keynote
Auckland Summit KeynoteAuckland Summit Keynote
Auckland Summit Keynote
Amazon Web Services
 
AWS Summit 2013 | India - Running Enterprise Applications like SAP, Oracle an...
AWS Summit 2013 | India - Running Enterprise Applications like SAP, Oracle an...AWS Summit 2013 | India - Running Enterprise Applications like SAP, Oracle an...
AWS Summit 2013 | India - Running Enterprise Applications like SAP, Oracle an...
Amazon Web Services
 

Similar to Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR (20)

Effective Cost Management for Amazon EMR
Effective Cost Management for Amazon EMREffective Cost Management for Amazon EMR
Effective Cost Management for Amazon EMR
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
AWS Summit Berlin 2013 - Realtech - How to Determine the Economic Value of SA...
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Deep dive session - sap and aws - extend and innovate
Deep dive session - sap and aws - extend and innovateDeep dive session - sap and aws - extend and innovate
Deep dive session - sap and aws - extend and innovate
 
Big dataandhp cforawsbrasilsummit
Big dataandhp cforawsbrasilsummitBig dataandhp cforawsbrasilsummit
Big dataandhp cforawsbrasilsummit
 
Simplify Your Migration to AWS and Cut Costs by 30% with TSO Logic
 Simplify Your Migration to AWS and Cut Costs by 30% with TSO Logic Simplify Your Migration to AWS and Cut Costs by 30% with TSO Logic
Simplify Your Migration to AWS and Cut Costs by 30% with TSO Logic
 
Amazon Web Services - The New Normal
Amazon Web Services - The New NormalAmazon Web Services - The New Normal
Amazon Web Services - The New Normal
 
Sap on aws webinar on reducing tco 07092017
Sap on aws  webinar on reducing tco 07092017Sap on aws  webinar on reducing tco 07092017
Sap on aws webinar on reducing tco 07092017
 
Blogthetech why are companies investing billions in sap implementation
Blogthetech why are companies investing billions in sap implementationBlogthetech why are companies investing billions in sap implementation
Blogthetech why are companies investing billions in sap implementation
 
Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS Cloud
 
AWS webinar what is cloud computing 13 09 11
AWS webinar what is cloud computing 13 09 11AWS webinar what is cloud computing 13 09 11
AWS webinar what is cloud computing 13 09 11
 
Cloud Economics: Transform Businesses at Lower Costs - AWS Summit Bahrain 2017
Cloud Economics: Transform Businesses at Lower Costs - AWS Summit Bahrain 2017Cloud Economics: Transform Businesses at Lower Costs - AWS Summit Bahrain 2017
Cloud Economics: Transform Businesses at Lower Costs - AWS Summit Bahrain 2017
 
SAP on AWS | Scottsdale, AZ
SAP on AWS | Scottsdale, AZSAP on AWS | Scottsdale, AZ
SAP on AWS | Scottsdale, AZ
 
Aws what is cloud computing deck 08 14 13
Aws what is cloud computing deck 08 14 13Aws what is cloud computing deck 08 14 13
Aws what is cloud computing deck 08 14 13
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
 
Hong Kong AWS Summit 2017 - Keynote
Hong Kong AWS Summit 2017 - KeynoteHong Kong AWS Summit 2017 - Keynote
Hong Kong AWS Summit 2017 - Keynote
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
Auckland Summit Keynote
Auckland Summit KeynoteAuckland Summit Keynote
Auckland Summit Keynote
 
AWS Summit 2013 | India - Running Enterprise Applications like SAP, Oracle an...
AWS Summit 2013 | India - Running Enterprise Applications like SAP, Oracle an...AWS Summit 2013 | India - Running Enterprise Applications like SAP, Oracle an...
AWS Summit 2013 | India - Running Enterprise Applications like SAP, Oracle an...
 

More from Provectus

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP Solution
Provectus
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Provectus
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
Provectus
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
Provectus
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
Provectus
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
Provectus
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
Provectus
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
Provectus
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
Provectus
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
Provectus
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
Provectus
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
Provectus
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
Provectus
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
Provectus
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
Provectus
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAM
Provectus
 
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC MeetupYurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Provectus
 

More from Provectus (20)

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP Solution
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAM
 
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC MeetupYurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
Yurii Gavrilin | ML Interpretability: From A to Z | Kazan ODSC Meetup
 

Recently uploaded

Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
ScyllaDB
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
ThousandEyes
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
UiPathCommunity
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
ScyllaDB
 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
Larry Smarr
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
ThousandEyes
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes
 
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
UiPathCommunity
 
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State StoreElasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
ScyllaDB
 
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer ExperienceHow to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
Aggregage
 
Automation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI AutomationAutomation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI Automation
UiPathCommunity
 
Getting Started Using the National Research Platform
Getting Started Using the National Research PlatformGetting Started Using the National Research Platform
Getting Started Using the National Research Platform
Larry Smarr
 
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
SOFTTECHHUB
 
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
DianaGray10
 
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
dipikamodels1
 
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
gaydlc2513
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
John Sterrett
 

Recently uploaded (20)

Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
 
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time MLMongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
MongoDB vs ScyllaDB: Tractian’s Experience with Real-Time ML
 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
 
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
 
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State StoreElasticity vs. State? Exploring Kafka Streams Cassandra State Store
Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
 
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer ExperienceHow to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
How to Optimize Call Monitoring: Automate QA and Elevate Customer Experience
 
Automation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI AutomationAutomation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI Automation
 
Getting Started Using the National Research Platform
Getting Started Using the National Research PlatformGetting Started Using the National Research Platform
Getting Started Using the National Research Platform
 
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
EverHost AI Review: Empowering Websites with Limitless Possibilities through ...
 
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
 
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
 
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
 

Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR

  • 1. Cost Optimization for Hadoop/Spark Workloads with Amazon EMR Presented by: June 2, 2020 Pritpal Sahota Technical Account Manager Provectus Stepan Pushkarev Chief Technical Officer Provectus Nirav Shah Senior Solution Architect Amazon Web Services Perry Peterson Business Development Manager Amazon Web Services
  • 2. 1. Provide significant value on how to optimize the cost by migrating to Amazon EMR 1. Hadoop-Spark workloads to Amazon EMR migration risk mitigation and best practices Webinar Objectives
  • 3. • Introduction • Hadoop market and Cost optimizations using Amazon EMR • Cost related and other challenges of on-prem Hadoop clusters • Cost optimizations by using Amazon EMR and migration best practices • Amazon EMR migration acceleration workshop overview Agenda
  • 4. Stepan Pushkarev Chief Technology Officer Provectus Pritpal Sahota Technical Account Executive Provectus Presenters Nirav Shah Senior Solutions Architect Amazon Web Services Perry Peterson Business Development Manager – Analytics Amazon Web Services
  • 5. AWS Partner Network (APN) Premier Consulting Partner AI-first Consultancy & Solutions Provider Сlients ranging from fast-growing startups through large enterprises 450 employees and growing Established in 2010 HQ in Palo Alto Offices across the US, Canada, and Europe
  • 6. Machine Learning Employ analytical algorithms to unveil hidden value from raw data that helps solve business challenges DevOps/DevSecOps Improve development and delivery pipelines to bring your product to the market faster and resiliently Next Gen Cloud Modernize your application and data landscape to allow for more agility and better service to your customers Big Data Gain data-driven insights through the holistic data analysis made available with a big data platform AWS Competencies in Machine Learning, Data & Analytics, and DevOps Core Competencies
  • 7. Innovative Tech Vendors Seeking for niche expertise to differentiate and win the market Enterprises Seeking to accelerate innovation, achieve operational excellence Clientele
  • 8. Hadoop Market and Cost Optimization using Amazon EMR
  • 9. Rapid growth of cloud adoption in big data space 7.5x faster than on-prem installs as per Forrester Research Uncertainty with leading Hadoop commercial vendors Leading commercial Hadoop vendors face uncertainty & headwinds. Customers are exploring cloud to leverage cost benefits, flexibility, scalability, & performance per price Large & growing Hadoop market According to market study report, over the next five years the Hadoop market will register a 33% annual revenue growth with market size reaching $9.4B by 2024 Availability of Resources Big data engineers prefer to work on cloud based big data solutions Hadoop market
  • 10. Amazon EMR is an enterprise-grade Spark/ Hadoop managed service helping businesses, researchers, data analysts, and developers to process and analyze vast amounts of data. EMR solves complex technical/business challenges: clickstream and log analysis along with real-time and predictive analytics. In comparison to on-premises deployments, IDC confirms Amazon EMR provides year 1 savings of 57% and 342% ROI over 5 years. What is EMR & where is it in the Analytics stack?
  • 11. EMR powers most cloud Hadoop/Spark projects © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 12. processes 135B events/day and have cost savings of 60% (~$20M) decreased costs by $600k in less than 5 months saves 75% and is 60% more efficient achieves costs savings of 55% when compared to on-demand pricing and 40% savings when compared to Reserved Instances High-impact results with Amazon EMR
  • 13. near real-time analytics for 140M players scales 3,000 transient clusters on a daily basis powers the Predix solution processing 1M data executions/day computes Zestimates on 100M +homes in hours instead of 1 day reduced cost of operation and improved Spark performance 3x High-impact results with Amazon EMR
  • 14. NinthDecimal is the omnichannel marketing platform helping Fortune 500 brands identify new prospects and customers, drive store visits, and increase sales using AI- and data-driven consumer intelligence. Ninthdecimal is seeing 3x speedup for Spark workloads on Amazon EMR and 3-5x of cost reduction. It means better SLAs for delivering insights to the clients and improved bottom line of the business.
  • 15. IMVU is the world’s largest avatar-based social network serving 6M+ players and 40M+ virtual goods IMVU has migrated 450+ Spark & Hive jobs and re- architected monolithic Hadoop environment into transient Amazon EMR clusters orchestrated with Airflow pipelines. By moving to AWS and Amazon EMR saved 30% of costs and became 80% more efficient in data engineering and analytics.
  • 16. 57% reduction in cost of ownership 342% five-year ROI 8 months to breakeven 99% reduction in unplanned downtime 33% more efficient Big Data teams 46% more efficient Big Data/Hadoop management staff Referenced IDC White Paper: "The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR" IDC study: Hadoop to Amazon EMR migration
  • 17. Amazon EMR Migration patterns and Best Practices Overview
  • 18. Amazon EMR Migration Patterns On- Premise s Lift & Shift Instance Right-Sizing S3 vs. HDFS Transient clusters ● Lift & Shift a. Low Risk & Lowest migration cost b. Very high ongoing cost c. Low business value addition d. quickest time to market ● Re-Architect - Migrate to Amazon EMR with a new architecture with complementary services to optimize the cost and to provide additional functionality, scalability, flexibility etc. a. Medium risk, Medium Migration cost b. Medium ongoing cost c. High business value addition d. Medium time to market ● Next Gen Architecture - Migrate to Amazon EMR with a completely new architecture which may include Streaming, Containers with added functionality, scalability, flexibility etc. a. High risk, Highest Migration Cost b. Lowest ongoing cost c. Highest business value addition d. Longest time to market
  • 19. An approach to best practice deployment Go beyond a lift & shift to optimize for scale and cost. On-Premises Lift & Shift Instance Right-Sizing Amazon S3 vs. HDFS Transient clusters Auto- scaling Spot Pricing Automated Orchestration Amazon EMR Optimized True TCO comparison
  • 20. Business factors: Capex->Opex On-prem license fees Maintenance Overhead Uncertainty in Hadoop Vendors Lowest pricing comparing to other Hadoop/Spark premium vendors Amazon EMR Value Add: Decoupled Storage & Compute Transient clusters Spot pricing Autoscaling Optimised hardware Amazon S3 lifecycle Proprietary Spark Amazon EMR engine Next Gen Architecture Value Add: Data Pipelines optimization Streaming processing Serverless ETL Serverless ad-hoc queries Serverless Data Catalog Workloads decomposition (Amazon EMR, Amazon Redshift, Athena, SageMaker) 10-20% Cost Reduction + 10-40% Reduction + 20-90% Reduction Overview of Cost Optimization Factors
  • 21. Migration Risk Mitigation Strategies On- Premise s Lift & Shift Instance Right-Sizing S3 vs. HDFS Transient clusters Auto- scaling Spot Pricing Automated Orchestration EMR Optimize d True TCO compariso n ● Analyze all application and workloads to ascertain compute, memory, storage, run time of day/week/month and any other infrastructure needs ● Develop a Business Value and Implementation Complexity Model for all applications and workloads, Plot business value vs. complexity Prioritization Matrix ● Organized Mirroring of Data loads on to Amazon EMR cluster with on-prem Hadoop cluster ● Start moving Workloads on to Amazon EMR in an orderly fashion. ● Identify excited innovators within each business unit to promote and spread on-prem to Amazon EMR migration ● Work with experts like Provectus to lead this effort. Complexity BusinessValue A D B C F E G Initial Workloads to migrate
  • 22. 1. Build a business case of Amazon EMR Migration including comparative cost analysis 2. Develop a risk mitigation plan 3. Design Next-Gen Data Platform and its adoption roadmap 4. Hands-on execute migration and re-architecture How Provectus can help
  • 23. Cost and other challenges of On-Prem Hadoop/Spark Environments
  • 24. Compute and storage growth Tightly coupled ● Storage grows along with compute ● Compute requirements vary 3x ● Data is replicated several times ● Typically only on one data center
  • 25. Underutilized or scarce resources 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 60 80 100 120 Re-processing Weekly peaks Steady state
  • 26. Contention for the same resources Compute bound Memory bound
  • 27. With a monolithic cluster, there may be dependencies of downstream applications that impact the inability to upgrade versions. By not upgrading, organizations could be limiting innovation. ● Large Scale Transformation: Map/Reduce, Hive, Pig, Spark ● Interactive Queries: Impala, Spark SQL, Presto ● Machine Learning: Spark ML, MxNet, Tensorflow ● Interactive Notebooks: Jupyter, Zeppelin ● NoSQL: HBase Limited on fast following app versions
  • 29. Amazon EMR Benefits Amazon S3 is your persistent storage - 99.9999999% durability, Low cost and many varieties, Life cycle policies, Versioning, Distributed by default, and EMRFS Decouple storage and compute Turn off the cluster Auto-scaling | Persistent & transient clusters
  • 30. Logical separation of jobs/applications Re-architect Monolithic to Purpose-built clusters by: • Creating Transient and/or Persistent clusters • Separating clusters by Application • Separating clusters by Application Version • Isolating Department specific clusters Design consideration are given to: • How you submit jobs or build pipelines • Persisting your data in Amazon S3 • Storing metadata off the cluster • How long the job runs • What applications are needed Purpose-built Clusters Traditional Monolithic Cluster
  • 31. Built-in disaster recovery Cluster 1 Cluster 2 Cluster 3 Cluster 4 Availability Zone
  • 32. Parallelization on Spot can drastically reduce time-to-insight and cost. Example 1: Baseline example of using RI 10 node cluster running for 14 hours Cost = $1.0 * 10 nodes * 14 hours = $140 Example 2: Scale more nodes with Spot Add 10 more nodes of Spot at 50% discount 20 node cluster running for 7 hours Cost = $1.0 * 10 nodes * 7 hours = $70 = $0.5 * 10 nodes * 7 hours = $35 Total $105 Auto-scale nodes with Spot instances
  • 33. ● The EMR Runtime for Apache Spark available in Amazon EMR v5.28 realized Spark improvements of up to 32x against TPC- DS 3TB dataset in comparison to Amazon EMR v5.16 (reference) ● The Amazon EMR Runtime for Apache Spark maintains API compatibility with OSS Spark ● More coming every release Spark performance improvements
  • 34. Analysts confirm lowest TCO Feb. 2019, Forrester recognizes Amazon EMR as the Cloud Hadoop/Spark (HARK) Leader. Nov. 2018, IDC report confirms: “EMR provides 57% reduced costs vs. on premise resulting in 342% ROI over 5 years.” Dec. 2018, Gartner suggests: “AWS remains the largest Hadoop provider in terms of both revenue and user base.” The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave™. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
  • 35. Benefits Summary 1. Decoupled compute & storage 2. Built-in disaster recovery 3. Turn off your clusters after use 4. Agility of auto-scaling of the clusters 5. Leverage Spot pricing for unused Amazon EC2 capacity 6. Self-service with AWS Service Catalog 7. Spark performance improvements 8. Fully managed Amazon EMR Notebooks 9. Centralized assets and data pipeline orchestration 10. Lowest TCO in the Industry, analysts confirm 11. Amazon EMR is surrounded by the industry’s broadest analytics ecosystem
  • 37. Serverless analytics Amazon S3 Data lake AWS Glue (ETL & Data Catalog) Athena QuickSight Serverless. Zero infrastructure. Zero administration $ Never pay for idle resources Availability and fault tolerance built in Automatically scales resources with usage AWS IoT AI/ML Devices Web Sensors Social
  • 38. AWS Glue Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python, Scala, and Spark Data Catalog • Glue crawlers automatically discovers data and stores schema • Catalog makes data searchable, and available for ETL and queries • Computes statistics to make queries efficiently Serverless ETL & Data Catalogue ETL • Generates customizable code for common file type conversion and partitioning • Schedules and runs your ETL jobs • Serverless, flexible, and built on open standards
  • 39. Amazon Athena Zero setup cost; just point to Amazon S3 and start querying ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Serverless: zero infrastructure, zero administration Integrated with QuickSight Pay only for queries run; save 30–90% on per- query costs through compression Query Instantly Open EasyPay per query Serverless Interactive Query engine • Interactive query service to analyze data in Amazon S3 using standard SQL • No infrastructure to set up or manage and no data to load • Ability to run SQL queries on data archived in Amazon S3 Glacier SQL
  • 40. 90% of your Hadoop Costs Hadoop Common Pipeline Pattern 1
  • 41. 90% of your Hadoop Costs Hadoop Common Pipeline Pattern 2
  • 42. 2-3x of cost reduction From Big Data to Fast Data
  • 43.
  • 44. 125 University Avenue Suite 290, Palo Alto California, 94301 provectus.com Questions, details? We would be happy to answer!
  翻译: