尊敬的 微信汇率:1円 ≈ 0.046089 元 支付宝汇率:1円 ≈ 0.04618元 [退出登录]
SlideShare a Scribd company logo
About HiFX
Established in the year 2001, HiFX is an Amazon Web Services
Consulting Partner.
We have been designing and migrating workloads in AWS cloud
since 2010 and helping organizations to become truly data driven
by building big data solutions since 2015
Case Study with Malayala Manorama
Malayala Manorama is one of the largest media conglomerates in India. They
run manoramaonline.com, the largest news portal for Malayalees, around the
world and several digital media properties
In 2016, Manorama embarked on a project to develop an in-house
analytics pipeline that could unify enormous amounts of raw data from
multiple web domains and convert it into meaningful insights. The
company currently has 10 domains such as its matrimonial and real
estate sites, with plans to further expand its digital footprint.
HiFX, has been Malayala Manorama’s technology partner for more than
18 years and was approached to design this new data analytics pipeline.
Manorama Online
Manorama News
The Week
Vanitha
Watchtime India
E-paper/E-magazine
Chuttuvattom
OnManorama
M4Marry
HelloAddress
QuickeralaQkdoc
Entedeal
Manorama Horizon
Android
iOS
Manorama MAX
2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to
make smart business decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there
was a need to design the collection and storage layers that would scale well
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in
processing
Dozens of independently managed collections of data, leading to data silos. Having no
single source of truth was leading to difficulties in identifying what type of data is
available, granting access and integration
04
03
02
01
About Lens
Vision Lens is a unified data platform with a
consolidated solution stack to
generate meaningful real time
insights and drive revenue
“
“
Better product
decisions based
on behavioral
insights
Add value
to our
businesses
€
Increase CLV
Deeply
understand every
user's journey
Immediate actions,
smart targeting and
marketing
automation
Positively
impact KPIs
Components
02A well governed data lake
architected to store raw and
enriched data thereby
eliminating storage silos
WELL GOVERNED DATA LAKE
01 Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near real-time access to any data
source
UNIFIED DATA PIPELINE
03
Data processing framework to
support streams and batches
workloads to aid analytics and
machine learning along with
smart workflow management
DATA PROCESSING
FRAMEWORK
05
Recommendations and
personalization engine powered
by machine learning
RECOMMENDATIONS ENGINE
04Well designed big data stores
for reporting and exploratory
analysis
BIG DATA STORES FOR OLAP
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
SMART DASHBOARDS
Solution Stack
04
Track Key metrics : visits,
plays,dropouts and minutes
watched
VIDEO ANALYTICS
Watch Attention shift in
real time
Updates every few
seconds to quickly
capitalize attention to
every post, campaign and
sections
STREAMING ANALYTICS
01
02Historical View of unique
attention metrics to understand
what happened in the past and
use it to plan for the future
BATCH ANALYTICS
03
Integrations with Google
Accelerated Pages and
Facebook Instant articles
FB IA AND GOOGLE AMP
INTEGRATIONS
05
Recommendations and
personalization engine
powered by machine learning
CONTENT
PERSONALIZATION
06
Dynamic dashboards and smart
visualizations that makes data
tell stories and drives insights.
ADVANCED REPORTING
Clean structured data that
team can analyze directly
RAW DATA ACCESS07
Key Infrastructure Components
CloudFront
ECS
Kinesis Stream
S3 Bucket
EMR Spark
Sagemaker
Aurora
Redshift
Elasticsearch Service
DynamoDB
DatabricksAWS ALB Apache Airflow
Architecture
Trackers
Android SDK IOS SDK JS SDK PHP SDK Java SDK
Data / Event Trackers
Trackers allow us to collect data from any
type of digital application, service or
device. All trackers adhere to the LENS
Tracker Protocol.
Collectors-Scribe
Data Collectors
04
03
Written in
Go/Java
02
Designed for
Low LatencyEngineered for
High
Concurrency
Horizontally
Scalable
01
Scribe collects data from the trackers
and writes them to the Kinesis data
firehose.
This allows near-real time processing of
data as well as storage in the data lake
for further batch analysis.
Use ECS Fargate for the
containerization.
Scribe API endpoints
• Event tracker
• Pixel tracker
• Click tracker
• AMP tracker
Accumulo /Data Lake
A
ACCUMULO
The data consumer component
responsible for -
Reading data from the event
firehose ( Kinesis Streams )
Performing rudimentary data
quality checks
Converting data to Avro
Format with Snappy
Compression
Loading them to the Data
Lake
DATA LAKE
Data Lake supports the following
capabilities
Capture and store raw data securely
at scale at a low cost
Store many types of data in the same
repository
Define the structure of the data at
the time it is used
It is designed to
Retain all data
Support all data types
Adapt Easily to changes
Prism - Processing Engine
Using Apache Spark as our processing Engine.
It’s written in Scala.
It can run on EMR-5.27 and as a Databricks job running
on AWS spot/on-demand instances
Unified Processing Engine
Prism
Analytics Engine
Prism - Processing Engine
Data Cleanser
Performs data cleansing
including:
• Normalization
• De-duplication
• Bot-exclusion
• Fixes for client clock issues.
Data Enricher
Performs enrichment activities
including:
• User Agent Parsing to
understand OS / Platform
• Referrer Parsing to understand
channels
• IP to location transformation
• Lat+Long to location
transformation
• Widen event data with user
profile information
Data Quality Checks
Performs the data quality checks
needed to detect, report and omit
instrumentation errors
Data Reconciler
Reconciles data that is
sacrosanct like transactions
from the feeds generated by
the master db
Sessionization/User Merging
Sessionize and merge the users
based on domain/anonymous id
15
Prism
Analytics Engine
Data Refresher
Loads the data to respective tables
in the data warehouse and other
reporting data stores
Prism - Real-time Analytics
• Use structured streaming to stream live events
into Elastic Search.
• Stack can be run on both EMR and Databricks
• Run in 50 -4.x large instances, which is scaled
to 100 instances during the election time.
• Configurations:-
spark.executor.cores=4
spark.executor.memory=25g
spark.executor.instances=50
Spark Streaming
Spark Streaming
Prism - Batch Analytics
Spark on EMR/Databricks
Spark• Scheduled Job which kick off every
day to process all the events for a
day and write the cleansed
raw/aggregated data to the redshift
(primary data store).
• It also writes the data to Parquet
Format to run presto/Databricks
delta lake on the top if needed.
• Runs in 20 – r4.2xlarge instances
• Configurations:-
spark.executor.cores=3
spark.executor.memory=20g
spark.executor.instances=39
Data Stores
DATA WAREHOUSE
AMAZON REDSHIFT
Primary Data Store
• Supports batch workloads.
• Supports up to 50
concurrent queries
• Cache layer pgpool deployed
• WLM and concurrency
scaling enabled
• Elastic Resize
• Redshift spectrum to query
archived data in S3
01
REALTIME REPORTING STORE
Elasticsearch
Content Analytics Real Time
Dashboard.
• Fluidic Dashboard with
granular filters
• Data Visualization using
Kibana
02
RECOMMENDATION RESULTS
DYNAMODB
Features like,
Horizontally Scalability, low
operational overhead and
predictable performance
make Dynamodb a good
choice for storing
recommendation results
03
Orchestration
Used to programmatically author, schedule
and monitor workflows.
Workflow Management
Rich UI that makes it easy to visualize
pipelines running in production, monitor
progress, and troubleshoot issues when
needed.
Rich UI
Apache Airflow
Data Retention Strategy
 Find a balance between what’s optimal for your clients’ business needs vs. operational cost effectiveness
 Ensure the data retention policies align with the regulatory restrictions(GDPR)
 Define proper life cycle policies at different stages
 S3-IA/Glacier lifecycle policy defined for the data at rest in Data lake and a scheduled purging policy defined
for the primary data store(redshift)
 We keep a quarter worth of data in the primary data store(redshift) and older data is archived to S3.
 Redshift Spectrum is used for detailed analysis of older data.
 For YOY, QOQ comparison we pre-calculate it as part of the quarterly process and store the aggregated results
into the data store.
Page Views
Dashboard - KPIs/Different Angles
Domain Specific KPIs
Key Metrics in the Content
Dashboard.
Different Angles
New and returning Visitors
Explore the Content Data from these
Angles
Engaged Time
Social Shares and
Referrals
Bounce Rate
Video Play Rate
Titles
Authors
Sections
Tags
Referrers
Campaigns
Google AMP Facebook IA
Scalability /Performance
Collect, Storage and Process layers designed to Autoscale.
Batch analytics takes an average of 30-40 mins to process and refresh data for the entire day
across all reporting dashboards
Turnaround latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156
ms
Currently handles about 150 GB of data per day with an average of 300 million events processed
per day
Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting
Stores
04
03
02
01
The real time streaming stack currently processes 500K events in less than 10 seconds.
05
06
Best Practices in Spark
 Use Dataset, DataFrames, Spark SQL instead of RDD to get the benefits of catalyst optimizer
 Choose the best data format and compression.
 Apache Parque gives the fastest read performance with the spark with its vectorized Parquet reader. Run
presto/Databricks delta lake on the top if needed.
 Avro offers rich schema support and more efficient writes than Parquet.
 Choose either Snappy or LZO compression as they have balance in terms of split-ability and block compression.
 Use the Spark Web UI to explore your task jobs, storage, and SQL query plan to optimize your spark execution
 Look at the spark event timeline to see the amount of time for each stage/tasks
 Check the shuffles between stages and the amount of data shuffled(Use the spark.sql.shuffle.partitions option
if needed)
 Check the join algorithms being used.
 Broadcast join should be used when one table is small.
 Sort-merge join should be used for large tables. You can use bucketing to pre-sort and group tables; this will
avoid shuffling in the sort merge
 Enable Dynamic Partition Pruning/ flattenScalarSubqueriesWithAggregates/ Bloom Filter Join/ Optimized Join
Reorder
 Use s3 instead of s3a/s3n protocol to refer the data so that it goes through the optimized path
 Use EMRFS consistency only if its required
 Find an optimal configurations on number of executors, memory setting for each executors and the no of cores for
the spark job.
Outcomes
Ability to run targeted mobile push and email campaigns
Consistent KPI measurement. The client has a consistent framework across properties to
measure KPIs
Dozens of independently managed collections of data, leading to data silos. Having no single
source of truth was leading to difficulties in identifying what type of data is available, getting
access to it and integration.
Better user experience. Recommendations running off the data in the Data Lake add value to the
digital properties we manage
Better business agility and product decisions based on behavioural insights. The journey from
data to decisions is made swifter
04
03
02
01
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS

More Related Content

What's hot

Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
How Workato creates robust data pipelines and automations for you?
How Workato creates robust data pipelines and automations for you?How Workato creates robust data pipelines and automations for you?
How Workato creates robust data pipelines and automations for you?
Jeraldine Phneah
 
2021 gartner mq dsml
2021 gartner mq dsml2021 gartner mq dsml
2021 gartner mq dsml
Sasikanth R
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)
Jeffrey T. Pollock
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the admin
Tillmann Eitelberg
 
Oracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionOracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer Introduction
Jeffrey T. Pollock
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
Data in Motion vs Data at Rest
Data in Motion vs Data at RestData in Motion vs Data at Rest
Data in Motion vs Data at Rest
Internap
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
Brandon Berlinrut
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff Pollock
Jeffrey T. Pollock
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
Amazon Web Services
 
Using Hadoop for Cognitive Analytics
Using Hadoop for Cognitive AnalyticsUsing Hadoop for Cognitive Analytics
Using Hadoop for Cognitive Analytics
DataWorks Summit/Hadoop Summit
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
DataWorks Summit
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
DataWorks Summit
 

What's hot (20)

Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
How Workato creates robust data pipelines and automations for you?
How Workato creates robust data pipelines and automations for you?How Workato creates robust data pipelines and automations for you?
How Workato creates robust data pipelines and automations for you?
 
2021 gartner mq dsml
2021 gartner mq dsml2021 gartner mq dsml
2021 gartner mq dsml
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)Tapping into the Big Data Reservoir (CON7934)
Tapping into the Big Data Reservoir (CON7934)
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 
Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the admin
 
Oracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionOracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer Introduction
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
Data in Motion vs Data at Rest
Data in Motion vs Data at RestData in Motion vs Data at Rest
Data in Motion vs Data at Rest
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff Pollock
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
Using Hadoop for Cognitive Analytics
Using Hadoop for Cognitive AnalyticsUsing Hadoop for Cognitive Analytics
Using Hadoop for Cognitive Analytics
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
 

Similar to ACDKOCHI19 - Next Generation Data Analytics Platform on AWS

Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWS
Sajith Appukuttan
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
ArunPandiyan890855
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
Amazon Web Services
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Amazon Web Services
 
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Amazon Web Services
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
Amazon Web Services
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Amazon Web Services
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
Amazon Web Services
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
Frank Kienle
 
Big Data Companies and Apache Software
Big Data Companies and Apache SoftwareBig Data Companies and Apache Software
Big Data Companies and Apache Software
Bob Marcus
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
Sheetal Pratik
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
MetroStar
 
Steering Away from Bolted-On Analytics
Steering Away from Bolted-On AnalyticsSteering Away from Bolted-On Analytics
Steering Away from Bolted-On Analytics
Connexica
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
Amazon Web Services
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptxTrack 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
Amazon Web Services
 

Similar to ACDKOCHI19 - Next Generation Data Analytics Platform on AWS (20)

Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWS
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Big Data Companies and Apache Software
Big Data Companies and Apache SoftwareBig Data Companies and Apache Software
Big Data Companies and Apache Software
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Steering Away from Bolted-On Analytics
Steering Away from Bolted-On AnalyticsSteering Away from Bolted-On Analytics
Steering Away from Bolted-On Analytics
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptxTrack 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
Track 1 Session 6_建立安全高效的資料分析平台加速金融創新_HC+EMQ Cliff(已檢核,上下無黑邊).pptx
 

More from AWS User Group Kochi

ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits markACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
AWS User Group Kochi
 
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity StonesACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
AWS User Group Kochi
 
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWSACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
AWS User Group Kochi
 
ACDKOCHI19 - Complete Media Content Management System and Website on Serverless
ACDKOCHI19 - Complete Media Content Management System and Website on ServerlessACDKOCHI19 - Complete Media Content Management System and Website on Serverless
ACDKOCHI19 - Complete Media Content Management System and Website on Serverless
AWS User Group Kochi
 
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
AWS User Group Kochi
 
ACDKOCHI19 - CI / CD using AWS Developer Tools
ACDKOCHI19 - CI / CD using AWS Developer ToolsACDKOCHI19 - CI / CD using AWS Developer Tools
ACDKOCHI19 - CI / CD using AWS Developer Tools
AWS User Group Kochi
 
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS CloudACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
AWS User Group Kochi
 
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
AWS User Group Kochi
 
ACDKOCHI19 - Opening Keynote - Building an Innovation mindset
ACDKOCHI19 - Opening Keynote - Building an Innovation mindsetACDKOCHI19 - Opening Keynote - Building an Innovation mindset
ACDKOCHI19 - Opening Keynote - Building an Innovation mindset
AWS User Group Kochi
 
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWSACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
AWS User Group Kochi
 
ACDKOCHI19 - IAM Everywhere
ACDKOCHI19 - IAM EverywhereACDKOCHI19 - IAM Everywhere
ACDKOCHI19 - IAM Everywhere
AWS User Group Kochi
 
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
AWS User Group Kochi
 
ACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemakerACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemaker
AWS User Group Kochi
 
ACDKOCHI19 - Building a serverless full-stack AWS native website
ACDKOCHI19 - Building a serverless full-stack AWS native websiteACDKOCHI19 - Building a serverless full-stack AWS native website
ACDKOCHI19 - Building a serverless full-stack AWS native website
AWS User Group Kochi
 

More from AWS User Group Kochi (14)

ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits markACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
ACDKOCHI19 - Medlife's journey on AWS from ZERO Orders to 6 digits mark
 
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity StonesACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
ACDKOCHI19 - Become Thanos of the Lambda Land: Wield all the Infinity Stones
 
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWSACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
ACDKOCHI19 - Rapid development, CI/CD for Chatbots on AWS
 
ACDKOCHI19 - Complete Media Content Management System and Website on Serverless
ACDKOCHI19 - Complete Media Content Management System and Website on ServerlessACDKOCHI19 - Complete Media Content Management System and Website on Serverless
ACDKOCHI19 - Complete Media Content Management System and Website on Serverless
 
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
ACDKOCHI19 - A minimalistic guide to keeping things simple and straightforwar...
 
ACDKOCHI19 - CI / CD using AWS Developer Tools
ACDKOCHI19 - CI / CD using AWS Developer ToolsACDKOCHI19 - CI / CD using AWS Developer Tools
ACDKOCHI19 - CI / CD using AWS Developer Tools
 
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS CloudACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
ACDKOCHI19 - Technical Presentation - Connecting 10000 cars to the AWS Cloud
 
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
ACDKOCHI19 - Journey from a traditional on-prem Datacenter to AWS: Challenges...
 
ACDKOCHI19 - Opening Keynote - Building an Innovation mindset
ACDKOCHI19 - Opening Keynote - Building an Innovation mindsetACDKOCHI19 - Opening Keynote - Building an Innovation mindset
ACDKOCHI19 - Opening Keynote - Building an Innovation mindset
 
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWSACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
ACDKOCHI19 - Enterprise grade security for web and mobile applications on AWS
 
ACDKOCHI19 - IAM Everywhere
ACDKOCHI19 - IAM EverywhereACDKOCHI19 - IAM Everywhere
ACDKOCHI19 - IAM Everywhere
 
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
ACDKOCHI19 - Turbocharge Developer productivity with platform build on K8S an...
 
ACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemakerACDKOCHI19 - Demystifying amazon sagemaker
ACDKOCHI19 - Demystifying amazon sagemaker
 
ACDKOCHI19 - Building a serverless full-stack AWS native website
ACDKOCHI19 - Building a serverless full-stack AWS native websiteACDKOCHI19 - Building a serverless full-stack AWS native website
ACDKOCHI19 - Building a serverless full-stack AWS native website
 

Recently uploaded

Ubuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdfUbuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdf
TechOnDemandSolution
 
Brightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentationBrightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentation
ILC- UK
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
ThousandEyes
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
DianaGray10
 
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
Prasta Maha
 
Product Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdfProduct Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdf
gaydlc2513
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
Paige Cruz
 
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB
 
Fuxnet [EN] .pdf
Fuxnet [EN]                                   .pdfFuxnet [EN]                                   .pdf
Fuxnet [EN] .pdf
Overkill Security
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
John Sterrett
 
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
ScyllaDB
 
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
Safe Software
 
Leveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptxLeveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptx
petabridge
 
Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
Neeraj Kumar Singh
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
manji sharman06
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
Neeraj Kumar Singh
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 

Recently uploaded (20)

Ubuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdfUbuntu Server CLI cheat sheet 2024 v6.pdf
Ubuntu Server CLI cheat sheet 2024 v6.pdf
 
Brightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentationBrightwell ILC Futures workshop David Sinclair presentation
Brightwell ILC Futures workshop David Sinclair presentation
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
 
Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024Kubernetes Cloud Native Indonesia Meetup - June 2024
Kubernetes Cloud Native Indonesia Meetup - June 2024
 
Product Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdfProduct Listing Optimization Presentation - Gay De La Cruz.pdf
Product Listing Optimization Presentation - Gay De La Cruz.pdf
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
 
ScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside LookScyllaDB Topology on Raft: An Inside Look
ScyllaDB Topology on Raft: An Inside Look
 
Fuxnet [EN] .pdf
Fuxnet [EN]                                   .pdfFuxnet [EN]                                   .pdf
Fuxnet [EN] .pdf
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Database Management Myths for Developers
Database Management Myths for DevelopersDatabase Management Myths for Developers
Database Management Myths for Developers
 
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
 
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
 
Leveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptxLeveraging AI for Software Developer Productivity.pptx
Leveraging AI for Software Developer Productivity.pptx
 
Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0Chapter 6 - Test Tools Considerations V4.0
Chapter 6 - Test Tools Considerations V4.0
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to SuccessDynamoDB to ScyllaDB: Technical Comparison and the Path to Success
DynamoDB to ScyllaDB: Technical Comparison and the Path to Success
 

ACDKOCHI19 - Next Generation Data Analytics Platform on AWS

  • 1. About HiFX Established in the year 2001, HiFX is an Amazon Web Services Consulting Partner. We have been designing and migrating workloads in AWS cloud since 2010 and helping organizations to become truly data driven by building big data solutions since 2015
  • 2. Case Study with Malayala Manorama Malayala Manorama is one of the largest media conglomerates in India. They run manoramaonline.com, the largest news portal for Malayalees, around the world and several digital media properties In 2016, Manorama embarked on a project to develop an in-house analytics pipeline that could unify enormous amounts of raw data from multiple web domains and convert it into meaningful insights. The company currently has 10 domains such as its matrimonial and real estate sites, with plans to further expand its digital footprint. HiFX, has been Malayala Manorama’s technology partner for more than 18 years and was approached to design this new data analytics pipeline.
  • 3. Manorama Online Manorama News The Week Vanitha Watchtime India E-paper/E-magazine Chuttuvattom OnManorama M4Marry HelloAddress QuickeralaQkdoc Entedeal Manorama Horizon Android iOS Manorama MAX
  • 4. 2 The Challenges Lack of agility and accessibility for data analysis which would aid the product team to make smart business decisions and improve strategies Increasing volume and velocity of data. With new digital properties getting added, there was a need to design the collection and storage layers that would scale well Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, granting access and integration 04 03 02 01
  • 6. Vision Lens is a unified data platform with a consolidated solution stack to generate meaningful real time insights and drive revenue “ “ Better product decisions based on behavioral insights Add value to our businesses € Increase CLV Deeply understand every user's journey Immediate actions, smart targeting and marketing automation Positively impact KPIs
  • 7. Components 02A well governed data lake architected to store raw and enriched data thereby eliminating storage silos WELL GOVERNED DATA LAKE 01 Connecting dozens of data streams and repositories to a unified data pipeline enabling near real-time access to any data source UNIFIED DATA PIPELINE 03 Data processing framework to support streams and batches workloads to aid analytics and machine learning along with smart workflow management DATA PROCESSING FRAMEWORK 05 Recommendations and personalization engine powered by machine learning RECOMMENDATIONS ENGINE 04Well designed big data stores for reporting and exploratory analysis BIG DATA STORES FOR OLAP 06 Dynamic dashboards and smart visualizations that makes data tell stories and drives insights. SMART DASHBOARDS
  • 8. Solution Stack 04 Track Key metrics : visits, plays,dropouts and minutes watched VIDEO ANALYTICS Watch Attention shift in real time Updates every few seconds to quickly capitalize attention to every post, campaign and sections STREAMING ANALYTICS 01 02Historical View of unique attention metrics to understand what happened in the past and use it to plan for the future BATCH ANALYTICS 03 Integrations with Google Accelerated Pages and Facebook Instant articles FB IA AND GOOGLE AMP INTEGRATIONS 05 Recommendations and personalization engine powered by machine learning CONTENT PERSONALIZATION 06 Dynamic dashboards and smart visualizations that makes data tell stories and drives insights. ADVANCED REPORTING Clean structured data that team can analyze directly RAW DATA ACCESS07
  • 9. Key Infrastructure Components CloudFront ECS Kinesis Stream S3 Bucket EMR Spark Sagemaker Aurora Redshift Elasticsearch Service DynamoDB DatabricksAWS ALB Apache Airflow
  • 11. Trackers Android SDK IOS SDK JS SDK PHP SDK Java SDK Data / Event Trackers Trackers allow us to collect data from any type of digital application, service or device. All trackers adhere to the LENS Tracker Protocol.
  • 12. Collectors-Scribe Data Collectors 04 03 Written in Go/Java 02 Designed for Low LatencyEngineered for High Concurrency Horizontally Scalable 01 Scribe collects data from the trackers and writes them to the Kinesis data firehose. This allows near-real time processing of data as well as storage in the data lake for further batch analysis. Use ECS Fargate for the containerization. Scribe API endpoints • Event tracker • Pixel tracker • Click tracker • AMP tracker
  • 13. Accumulo /Data Lake A ACCUMULO The data consumer component responsible for - Reading data from the event firehose ( Kinesis Streams ) Performing rudimentary data quality checks Converting data to Avro Format with Snappy Compression Loading them to the Data Lake DATA LAKE Data Lake supports the following capabilities Capture and store raw data securely at scale at a low cost Store many types of data in the same repository Define the structure of the data at the time it is used It is designed to Retain all data Support all data types Adapt Easily to changes
  • 14. Prism - Processing Engine Using Apache Spark as our processing Engine. It’s written in Scala. It can run on EMR-5.27 and as a Databricks job running on AWS spot/on-demand instances Unified Processing Engine Prism Analytics Engine
  • 15. Prism - Processing Engine Data Cleanser Performs data cleansing including: • Normalization • De-duplication • Bot-exclusion • Fixes for client clock issues. Data Enricher Performs enrichment activities including: • User Agent Parsing to understand OS / Platform • Referrer Parsing to understand channels • IP to location transformation • Lat+Long to location transformation • Widen event data with user profile information Data Quality Checks Performs the data quality checks needed to detect, report and omit instrumentation errors Data Reconciler Reconciles data that is sacrosanct like transactions from the feeds generated by the master db Sessionization/User Merging Sessionize and merge the users based on domain/anonymous id 15 Prism Analytics Engine Data Refresher Loads the data to respective tables in the data warehouse and other reporting data stores
  • 16. Prism - Real-time Analytics • Use structured streaming to stream live events into Elastic Search. • Stack can be run on both EMR and Databricks • Run in 50 -4.x large instances, which is scaled to 100 instances during the election time. • Configurations:- spark.executor.cores=4 spark.executor.memory=25g spark.executor.instances=50 Spark Streaming Spark Streaming
  • 17. Prism - Batch Analytics Spark on EMR/Databricks Spark• Scheduled Job which kick off every day to process all the events for a day and write the cleansed raw/aggregated data to the redshift (primary data store). • It also writes the data to Parquet Format to run presto/Databricks delta lake on the top if needed. • Runs in 20 – r4.2xlarge instances • Configurations:- spark.executor.cores=3 spark.executor.memory=20g spark.executor.instances=39
  • 18. Data Stores DATA WAREHOUSE AMAZON REDSHIFT Primary Data Store • Supports batch workloads. • Supports up to 50 concurrent queries • Cache layer pgpool deployed • WLM and concurrency scaling enabled • Elastic Resize • Redshift spectrum to query archived data in S3 01 REALTIME REPORTING STORE Elasticsearch Content Analytics Real Time Dashboard. • Fluidic Dashboard with granular filters • Data Visualization using Kibana 02 RECOMMENDATION RESULTS DYNAMODB Features like, Horizontally Scalability, low operational overhead and predictable performance make Dynamodb a good choice for storing recommendation results 03
  • 19. Orchestration Used to programmatically author, schedule and monitor workflows. Workflow Management Rich UI that makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. Rich UI Apache Airflow
  • 20. Data Retention Strategy  Find a balance between what’s optimal for your clients’ business needs vs. operational cost effectiveness  Ensure the data retention policies align with the regulatory restrictions(GDPR)  Define proper life cycle policies at different stages  S3-IA/Glacier lifecycle policy defined for the data at rest in Data lake and a scheduled purging policy defined for the primary data store(redshift)  We keep a quarter worth of data in the primary data store(redshift) and older data is archived to S3.  Redshift Spectrum is used for detailed analysis of older data.  For YOY, QOQ comparison we pre-calculate it as part of the quarterly process and store the aggregated results into the data store.
  • 21. Page Views Dashboard - KPIs/Different Angles Domain Specific KPIs Key Metrics in the Content Dashboard. Different Angles New and returning Visitors Explore the Content Data from these Angles Engaged Time Social Shares and Referrals Bounce Rate Video Play Rate Titles Authors Sections Tags Referrers Campaigns Google AMP Facebook IA
  • 22. Scalability /Performance Collect, Storage and Process layers designed to Autoscale. Batch analytics takes an average of 30-40 mins to process and refresh data for the entire day across all reporting dashboards Turnaround latency numbers at the data collector: 75 percentile - 27ms and 95 percentile - 156 ms Currently handles about 150 GB of data per day with an average of 300 million events processed per day Horizontally Scalable Data Collectors, Data Consumers, Data Processors and Data Reporting Stores 04 03 02 01 The real time streaming stack currently processes 500K events in less than 10 seconds. 05 06
  • 23. Best Practices in Spark  Use Dataset, DataFrames, Spark SQL instead of RDD to get the benefits of catalyst optimizer  Choose the best data format and compression.  Apache Parque gives the fastest read performance with the spark with its vectorized Parquet reader. Run presto/Databricks delta lake on the top if needed.  Avro offers rich schema support and more efficient writes than Parquet.  Choose either Snappy or LZO compression as they have balance in terms of split-ability and block compression.  Use the Spark Web UI to explore your task jobs, storage, and SQL query plan to optimize your spark execution  Look at the spark event timeline to see the amount of time for each stage/tasks  Check the shuffles between stages and the amount of data shuffled(Use the spark.sql.shuffle.partitions option if needed)  Check the join algorithms being used.  Broadcast join should be used when one table is small.  Sort-merge join should be used for large tables. You can use bucketing to pre-sort and group tables; this will avoid shuffling in the sort merge  Enable Dynamic Partition Pruning/ flattenScalarSubqueriesWithAggregates/ Bloom Filter Join/ Optimized Join Reorder  Use s3 instead of s3a/s3n protocol to refer the data so that it goes through the optimized path  Use EMRFS consistency only if its required  Find an optimal configurations on number of executors, memory setting for each executors and the no of cores for the spark job.
  • 24. Outcomes Ability to run targeted mobile push and email campaigns Consistent KPI measurement. The client has a consistent framework across properties to measure KPIs Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was leading to difficulties in identifying what type of data is available, getting access to it and integration. Better user experience. Recommendations running off the data in the Data Lake add value to the digital properties we manage Better business agility and product decisions based on behavioural insights. The journey from data to decisions is made swifter 04 03 02 01
  翻译: