尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Real Time Analytics
at Uber: Bring SQL
into Everything
Zhenxiao Luo
NYC
Uber’s mission is to
ignite opportunity by
setting the world in
motion.
15M
Trips/Day
600+
Cities
75M
Monthly Riders
Data informs every decision at the company
Overview of Uber’s Data Platform
DATA SOURCES
RAW DATA
MODELED TABLES
MINING BUSINESS
INSIGHTS
CONSUMING BUSINESS INSIGHTS
EXPERIMENTATION
DATA SCIENCE
MACHINE
LEARNING
CUSTOM DATA SETS
Dashboarding
Alerting
Monitoring
Data Exploration
Knowledge Bases
Storage
Infrastructure
ETL Frameworks
Data Integrity
Query Engines
Kafka
Uber Data Infrastructure
Schemaless
MySQL,
Postgres
Vertica
Streamio
Raw
Data
Raw
Tables
Sqoop
Reports
Hadoop
Hive Presto Spark
Notebook Ad Hoc Queries
Real Time
Applications
Machine
Learning Jobs
Business
Intelligence Jobs
Cluster
Management
All-Active
Observability
Security
Vertica
Samza
Pinot
Flink
AresDB
Modeled
Tables
Streaming
Warehouse
Real-time
Presto @ Uber-scale
5KWeekly Active Users
160KQueries/day
3Data Centers
2KNodes
700MHDFS files read/day
10PBHDFS files
processed/day
Presto use cases at Uber
Growth Marketing
Data Science
Marketplace
Pricing
Community
Operations
Data Quality
Ad-hoc Querying
The people who rely on us
Technical
Skills
Data Scientists
Software Engineers
ML/AI Researchers
Advanced SQL
Advanced Statistics
Scala/Spark, Python/R
Data Modeling
Inventor Ivan
Marketing Managers
Entry-level Analysts
General Managers
Product Managers
Limited SQL
Spreadsheets
Reliant Rebecca
City Operations
Regional Managers
Intermediate SQL
Spreadsheets
Dashboarding
Monitoring Matt
Operations Managers
Data Analysts
Product Analysts
Advanced SQL
Spreadsheets
Limited Statistics
Limited Python/R
Analyst Anna
Exploratory ML &
model-training
Data Scientists ML ResearchersEngineers
Using ML to ensure data
security and compliance
Advanced data
science &
complex analytics
Data Scientists Ops Analysts Support Agents
Surfacing hidden insights
to empower restaurants
Business process
automation
S&P AnalystsOps Managers Contractors
Using technology to make
transportation safer
What is Presto: Interactive SQL
Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL
Battle-tested by Facebook, Uber, Linkedin, Twitter, Netflix, Airbnb, etc
Completely open source
Access to petabytes of data in the Hadoop, Elasticsearch, Pinot, etc.
How Presto Works
Why Presto is Fast
● Data in memory during execution
● Pipelining and streaming
● Columnar storage & execution
● Bytecode generation
Resource Management
● Presto has its own resource manager
○ Not on YARN
○ Not on Mesos
● CPU Management
○ Priority queues
○ Short running queries higher priority
● Memory Management
○ Max memory per query per node
○ If query exceeds max memory limit, query fails
○ No OutOfMemory in Presto process
Presto Connectors:
No Need to Copy Data
Uber Contributions
Contributions
New Features
● Geospatial indexing and operations - 10x or more speedup
● Pinot connector enhancements (in-house)
Optimizations
● Elasticsearch connector
● New Parquet reader - 4x speedup
● Nested column pushdowns (project, predicate) - 10x speedup
Security
● Metastore authentication support for Kerberos deployments
● Dispatch Proxy using HTTP redirect for multi-cluster operation
Presto Connector Interface
● ConnectorMetadata
○ Schema, Table, Column
● ConnectorSplitManager
○ Divide data into splits
● ConnectorSplit
○ Split data range
○ Predicate/JsonFunction/Limit pushdown
● ConnectorRecordCursor
○ Transform underlying storage data into Presto internal
page/block
Presto Elasticsearch Connector
Data Model
● each Elasticsearch index is a table partition
● each field of an index is a column
● all Elasticsearch indexes sharing the same prefix
consist a logical table
○ Es-vehicles-sjc1, es-vehicles-dca1, es-vehicles
Describe Table
Query
Optimizations
● Parallel Reads
○ Get all indices and search nodes
○ For each search node, send request for one specific index
● Cap Max Hits
● Predicate Pushdown
● Json Function Pushdown
● Limit Pushdown
● Nested Fields
How many Uber trip
requests did we serve
in Chicago yesterday?
Fetch daily trip count in seconds
SELECT T.base.city_id AS cid,
Count(CASE WHEN T.base.status = 'completed' THEN 1 END) AS
completed_trips,
Count(CASE WHEN T.base.status = 'canceled' THEN 1 END) AS
rider_canceled_trips
FROM trips AS T
WHERE T.datestr = '2019-03-11'
GROUP BY 1
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.status
Column Chunk
base.vehicle_id
Column Chunk
base.city_idRow Group
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.city_id
Column Chunk
base.vehicle_id
Column Chunk
base.statusRow Group
Parquet
Parquet Footer: File Metadata, Row Group Metadata
Step 1: Read all Parquet nested fields from disk
base.driver_uuid base.client_uuid base.city_id …... base.vehicle_id base.status
base.driver_uuid
base.driver_uuid
base.driver_uuid
base.driver_uuid base.client_uuid base.city_id …... base.vehicle_id base.status
Presto Columnar Engine
Step 2: Transform Parquet rows into Presto columnar blocks
Step 3: Evaluate predicates on columnar blocks
base.client_uuid
base.client_uuid
base.client_uuid
base.city_id
base.city_id
base.city_id
base.vehicle_id
base.vehicle_id
base.vehicle_id
base.status
base.status
base.status
….
Default Apache Parquet Reader
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.status
Column Chunk
base.vehicle_id
Column Chunk
base.city_id
Row Group
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.city_id
Column Chunk
base.vehicle_id
Column Chunk
base.status
Row Group
Parquet Footer: File Metadata, Row Group Metadata
Step 1: Read ONLY Required nested fields from disk
Presto Columnar Engine
Apache Parquet Reader Optimization
base.driver_uuid
base.driver_uuid
base.driver_uuid
base.city_id
base.city_id
base.city_id
Step 1: Read ONLY Required nested fields from disk
Evaluate predicates on the fly:
Skip reading row group;
predicate: base.city_id = 12
dictionary: base.city_id: {3,
5, 9, 14, 21}
Build columnar blocks only
for predicate matches
Step 2. Build columnar blocks on the fly
base.driver_uuid
base.driver_uuid
base.driver_uuid
Step 3: Evaluate predicates on columnar blocks
Parquet
Results
Looking forward
Federated SQL Layer
Vision
HDFS
VerticaElasticsearch
Apache
Pinot
MySQL
Machines
Reports
Users
Presto RealTime Presto
Proxy layer
Management
Universal
Metadata
Service
Focus areas
Connectors
● Apache Hive, Apache Pinot, Elasticsearch, Apache Cassandra, Vertica, MySQL, etc
● Aggregation / Join pushdown
● Cross-connector optimizations (hybrid connectors)
Real-time
● Real-time mode with low latency pass through
● Query plan / result / data cache
● Time-series joins and stitching
Universal Metadata Service (UMS)
● Logical definitions / physical schemas
● Column stitching and joins
● Table and partition caching
Thank you
Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from
Uber. This document is intended only for the use of the individual or entity to
whom it is addressed. All recipients of this document are notified that the
information contained herein includes proprietary information of Uber, and
recipient may not make use of, disseminate, or in any way disclose this
document or any of the enclosed information to any person other than
employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

More Related Content

What's hot

Druid
DruidDruid
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
Juhong Park
 
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
Yooseok Choi
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법
Jeongsang Baek
 
Data Engineering 101
Data Engineering 101Data Engineering 101
Data Engineering 101
DaeMyung Kang
 
Centralized logging
Centralized loggingCentralized logging
Centralized logging
blessYahu
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
Seongyun Byeon
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
Timo Walther
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링
JANGWONSEO4
 
実践!DBベンチマークツールの使い方
実践!DBベンチマークツールの使い方実践!DBベンチマークツールの使い方
実践!DBベンチマークツールの使い方
Fujishiro Takuya
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
Chris Ohk
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
[PYCON Korea 2018] Python Application Server for Recommender System
[PYCON Korea 2018] Python Application Server for Recommender System [PYCON Korea 2018] Python Application Server for Recommender System
[PYCON Korea 2018] Python Application Server for Recommender System
Kwangseob Kim
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once Semantics
Yoshiyasu SAEKI
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기
NAVER D2
 

What's hot (20)

Druid
DruidDruid
Druid
 
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
 
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1  나무기술(주) 최유석 20170912
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법
 
Data Engineering 101
Data Engineering 101Data Engineering 101
Data Engineering 101
 
Centralized logging
Centralized loggingCentralized logging
Centralized logging
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
 
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링
 
実践!DBベンチマークツールの使い方
実践!DBベンチマークツールの使い方実践!DBベンチマークツールの使い方
実践!DBベンチマークツールの使い方
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
[PYCON Korea 2018] Python Application Server for Recommender System
[PYCON Korea 2018] Python Application Server for Recommender System [PYCON Korea 2018] Python Application Server for Recommender System
[PYCON Korea 2018] Python Application Server for Recommender System
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Apache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once SemanticsApache Kafka 0.11 の Exactly Once Semantics
Apache Kafka 0.11 の Exactly Once Semantics
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기
 

Similar to Real time analytics at uber @ strata data 2019

Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
Presto
PrestoPresto
Presto
Knoldus Inc.
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
kbajda
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
kiran palaka
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
Amit Banerjee
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
EDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container PlatformsEDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container Platforms
Ashnikbiz
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform Overview
Neo4j
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
kgshukla
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
Sadayuki Furuhashi
 
Neo4j Vision and Roadmap
Neo4j Vision and Roadmap Neo4j Vision and Roadmap
Neo4j Vision and Roadmap
Neo4j
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
Amazon Web Services
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Amazon Web Services
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
kbajda
 

Similar to Real time analytics at uber @ strata data 2019 (20)

Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Presto
PrestoPresto
Presto
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
EDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container PlatformsEDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container Platforms
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform Overview
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Neo4j Vision and Roadmap
Neo4j Vision and Roadmap Neo4j Vision and Roadmap
Neo4j Vision and Roadmap
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 

More from Zhenxiao Luo

Presto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitPresto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto Summit
Zhenxiao Luo
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks Summit
Zhenxiao Luo
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
Zhenxiao Luo
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
Zhenxiao Luo
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
Zhenxiao Luo
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
Zhenxiao Luo
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 

More from Zhenxiao Luo (10)

Presto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitPresto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto Summit
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks Summit
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 

Recently uploaded

10 Conversion Rate Optimization (CRO) Techniques to Boost Your Website’s Perf...
10 Conversion Rate Optimization (CRO) Techniques to Boost Your Website’s Perf...10 Conversion Rate Optimization (CRO) Techniques to Boost Your Website’s Perf...
10 Conversion Rate Optimization (CRO) Techniques to Boost Your Website’s Perf...
Web Inspire
 
Bangalore Call Girls 9079923931 With -Cuties' Hot Call Girls
Bangalore Call Girls 9079923931 With -Cuties' Hot Call GirlsBangalore Call Girls 9079923931 With -Cuties' Hot Call Girls
Bangalore Call Girls 9079923931 With -Cuties' Hot Call Girls
narwatsonia7
 
Measuring and Understanding the Route Origin Validation (ROV) in RPKI
Measuring and Understanding the Route Origin Validation (ROV) in RPKIMeasuring and Understanding the Route Origin Validation (ROV) in RPKI
Measuring and Understanding the Route Origin Validation (ROV) in RPKI
APNIC
 
169+ Call Girls In Navi Mumbai | 9930245274 | Reliability Escort Service Near...
169+ Call Girls In Navi Mumbai | 9930245274 | Reliability Escort Service Near...169+ Call Girls In Navi Mumbai | 9930245274 | Reliability Escort Service Near...
169+ Call Girls In Navi Mumbai | 9930245274 | Reliability Escort Service Near...
tanichadda371 #v08
 
Seizing the IPv6 Advantage: For a Bigger, Faster and Stronger Internet
Seizing the IPv6 Advantage: For a Bigger, Faster and Stronger InternetSeizing the IPv6 Advantage: For a Bigger, Faster and Stronger Internet
Seizing the IPv6 Advantage: For a Bigger, Faster and Stronger Internet
APNIC
 
一比一原版(uom学位证书)北安普顿大学毕业证如何办理
一比一原版(uom学位证书)北安普顿大学毕业证如何办理一比一原版(uom学位证书)北安普顿大学毕业证如何办理
一比一原版(uom学位证书)北安普顿大学毕业证如何办理
9nfobpgg
 
High Profile Call Girls Bangalore ✔ 9352988975 ✔ Hi I Am Divya Vip Call Girl ...
High Profile Call Girls Bangalore ✔ 9352988975 ✔ Hi I Am Divya Vip Call Girl ...High Profile Call Girls Bangalore ✔ 9352988975 ✔ Hi I Am Divya Vip Call Girl ...
High Profile Call Girls Bangalore ✔ 9352988975 ✔ Hi I Am Divya Vip Call Girl ...
hina sharma$A17
 
DocSplit Subsequent Implementation Activation.pptx
DocSplit Subsequent Implementation Activation.pptxDocSplit Subsequent Implementation Activation.pptx
DocSplit Subsequent Implementation Activation.pptx
AmitTuteja9
 
Unlimited Fun With Call Girls Hyderabad ✅ 7737669865 💘 FULL CASH PAYMENT
Unlimited Fun With Call Girls Hyderabad ✅ 7737669865 💘 FULL CASH PAYMENTUnlimited Fun With Call Girls Hyderabad ✅ 7737669865 💘 FULL CASH PAYMENT
Unlimited Fun With Call Girls Hyderabad ✅ 7737669865 💘 FULL CASH PAYMENT
keshavtiwari584
 
Top 10 Digital Marketing Trends in 2024 You Should Know
Top 10 Digital Marketing Trends in 2024 You Should KnowTop 10 Digital Marketing Trends in 2024 You Should Know
Top 10 Digital Marketing Trends in 2024 You Should Know
Markonik
 
Call Girls Chennai 📲 8824825030 Chennai Escorts (Tamil Girls) service 24X7
Call Girls Chennai 📲 8824825030 Chennai Escorts (Tamil Girls) service 24X7Call Girls Chennai 📲 8824825030 Chennai Escorts (Tamil Girls) service 24X7
Call Girls Chennai 📲 8824825030 Chennai Escorts (Tamil Girls) service 24X7
vrvipin164
 
Trends In Cybersecurity | Rise Of Iot Security Solutions | IoT Device Security
Trends In Cybersecurity | Rise Of Iot Security Solutions |  IoT Device SecurityTrends In Cybersecurity | Rise Of Iot Security Solutions |  IoT Device Security
Trends In Cybersecurity | Rise Of Iot Security Solutions | IoT Device Security
Lumiverse Solutions Pvt Ltd
 
peru primero de la alianza con el pacifico
peru primero de la alianza con el pacificoperu primero de la alianza con el pacifico
peru primero de la alianza con el pacifico
FernandoGuevaraVentu2
 
HistorySrSec2024 daahi sadhin sgg-25.pdf
HistorySrSec2024 daahi sadhin sgg-25.pdfHistorySrSec2024 daahi sadhin sgg-25.pdf
HistorySrSec2024 daahi sadhin sgg-25.pdf
AdiySgh
 
Introduction to Augmented Reality (AR) and Virtual Reality (.pptx
Introduction to Augmented Reality (AR) and Virtual Reality (.pptxIntroduction to Augmented Reality (AR) and Virtual Reality (.pptx
Introduction to Augmented Reality (AR) and Virtual Reality (.pptx
sonupal124
 
一比一原版(uofr学位证书)罗切斯特大学毕业证如何办理
一比一原版(uofr学位证书)罗切斯特大学毕业证如何办理一比一原版(uofr学位证书)罗切斯特大学毕业证如何办理
一比一原版(uofr学位证书)罗切斯特大学毕业证如何办理
adocd
 
Top UI/UX Design Trends for 2024: What Business Owners Need to Know
Top UI/UX Design Trends for 2024: What Business Owners Need to KnowTop UI/UX Design Trends for 2024: What Business Owners Need to Know
Top UI/UX Design Trends for 2024: What Business Owners Need to Know
Onepixll
 
Celebrity Navi Mumbai Call Girls 🥰 9967584737 🥰 Escorts Service Available Mumbai
Celebrity Navi Mumbai Call Girls 🥰 9967584737 🥰 Escorts Service Available MumbaiCelebrity Navi Mumbai Call Girls 🥰 9967584737 🥰 Escorts Service Available Mumbai
Celebrity Navi Mumbai Call Girls 🥰 9967584737 🥰 Escorts Service Available Mumbai
komal sharman06
 
Pigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention System
Pigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention SystemPigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention System
Pigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention System
lowkeyact
 
Full Night Fun With Call Girls Lucknow📞7737669865 At Very Cheap Rates Doorste...
Full Night Fun With Call Girls Lucknow📞7737669865 At Very Cheap Rates Doorste...Full Night Fun With Call Girls Lucknow📞7737669865 At Very Cheap Rates Doorste...
Full Night Fun With Call Girls Lucknow📞7737669865 At Very Cheap Rates Doorste...
monuc3758 $S2
 

Recently uploaded (20)

10 Conversion Rate Optimization (CRO) Techniques to Boost Your Website’s Perf...
10 Conversion Rate Optimization (CRO) Techniques to Boost Your Website’s Perf...10 Conversion Rate Optimization (CRO) Techniques to Boost Your Website’s Perf...
10 Conversion Rate Optimization (CRO) Techniques to Boost Your Website’s Perf...
 
Bangalore Call Girls 9079923931 With -Cuties' Hot Call Girls
Bangalore Call Girls 9079923931 With -Cuties' Hot Call GirlsBangalore Call Girls 9079923931 With -Cuties' Hot Call Girls
Bangalore Call Girls 9079923931 With -Cuties' Hot Call Girls
 
Measuring and Understanding the Route Origin Validation (ROV) in RPKI
Measuring and Understanding the Route Origin Validation (ROV) in RPKIMeasuring and Understanding the Route Origin Validation (ROV) in RPKI
Measuring and Understanding the Route Origin Validation (ROV) in RPKI
 
169+ Call Girls In Navi Mumbai | 9930245274 | Reliability Escort Service Near...
169+ Call Girls In Navi Mumbai | 9930245274 | Reliability Escort Service Near...169+ Call Girls In Navi Mumbai | 9930245274 | Reliability Escort Service Near...
169+ Call Girls In Navi Mumbai | 9930245274 | Reliability Escort Service Near...
 
Seizing the IPv6 Advantage: For a Bigger, Faster and Stronger Internet
Seizing the IPv6 Advantage: For a Bigger, Faster and Stronger InternetSeizing the IPv6 Advantage: For a Bigger, Faster and Stronger Internet
Seizing the IPv6 Advantage: For a Bigger, Faster and Stronger Internet
 
一比一原版(uom学位证书)北安普顿大学毕业证如何办理
一比一原版(uom学位证书)北安普顿大学毕业证如何办理一比一原版(uom学位证书)北安普顿大学毕业证如何办理
一比一原版(uom学位证书)北安普顿大学毕业证如何办理
 
High Profile Call Girls Bangalore ✔ 9352988975 ✔ Hi I Am Divya Vip Call Girl ...
High Profile Call Girls Bangalore ✔ 9352988975 ✔ Hi I Am Divya Vip Call Girl ...High Profile Call Girls Bangalore ✔ 9352988975 ✔ Hi I Am Divya Vip Call Girl ...
High Profile Call Girls Bangalore ✔ 9352988975 ✔ Hi I Am Divya Vip Call Girl ...
 
DocSplit Subsequent Implementation Activation.pptx
DocSplit Subsequent Implementation Activation.pptxDocSplit Subsequent Implementation Activation.pptx
DocSplit Subsequent Implementation Activation.pptx
 
Unlimited Fun With Call Girls Hyderabad ✅ 7737669865 💘 FULL CASH PAYMENT
Unlimited Fun With Call Girls Hyderabad ✅ 7737669865 💘 FULL CASH PAYMENTUnlimited Fun With Call Girls Hyderabad ✅ 7737669865 💘 FULL CASH PAYMENT
Unlimited Fun With Call Girls Hyderabad ✅ 7737669865 💘 FULL CASH PAYMENT
 
Top 10 Digital Marketing Trends in 2024 You Should Know
Top 10 Digital Marketing Trends in 2024 You Should KnowTop 10 Digital Marketing Trends in 2024 You Should Know
Top 10 Digital Marketing Trends in 2024 You Should Know
 
Call Girls Chennai 📲 8824825030 Chennai Escorts (Tamil Girls) service 24X7
Call Girls Chennai 📲 8824825030 Chennai Escorts (Tamil Girls) service 24X7Call Girls Chennai 📲 8824825030 Chennai Escorts (Tamil Girls) service 24X7
Call Girls Chennai 📲 8824825030 Chennai Escorts (Tamil Girls) service 24X7
 
Trends In Cybersecurity | Rise Of Iot Security Solutions | IoT Device Security
Trends In Cybersecurity | Rise Of Iot Security Solutions |  IoT Device SecurityTrends In Cybersecurity | Rise Of Iot Security Solutions |  IoT Device Security
Trends In Cybersecurity | Rise Of Iot Security Solutions | IoT Device Security
 
peru primero de la alianza con el pacifico
peru primero de la alianza con el pacificoperu primero de la alianza con el pacifico
peru primero de la alianza con el pacifico
 
HistorySrSec2024 daahi sadhin sgg-25.pdf
HistorySrSec2024 daahi sadhin sgg-25.pdfHistorySrSec2024 daahi sadhin sgg-25.pdf
HistorySrSec2024 daahi sadhin sgg-25.pdf
 
Introduction to Augmented Reality (AR) and Virtual Reality (.pptx
Introduction to Augmented Reality (AR) and Virtual Reality (.pptxIntroduction to Augmented Reality (AR) and Virtual Reality (.pptx
Introduction to Augmented Reality (AR) and Virtual Reality (.pptx
 
一比一原版(uofr学位证书)罗切斯特大学毕业证如何办理
一比一原版(uofr学位证书)罗切斯特大学毕业证如何办理一比一原版(uofr学位证书)罗切斯特大学毕业证如何办理
一比一原版(uofr学位证书)罗切斯特大学毕业证如何办理
 
Top UI/UX Design Trends for 2024: What Business Owners Need to Know
Top UI/UX Design Trends for 2024: What Business Owners Need to KnowTop UI/UX Design Trends for 2024: What Business Owners Need to Know
Top UI/UX Design Trends for 2024: What Business Owners Need to Know
 
Celebrity Navi Mumbai Call Girls 🥰 9967584737 🥰 Escorts Service Available Mumbai
Celebrity Navi Mumbai Call Girls 🥰 9967584737 🥰 Escorts Service Available MumbaiCelebrity Navi Mumbai Call Girls 🥰 9967584737 🥰 Escorts Service Available Mumbai
Celebrity Navi Mumbai Call Girls 🥰 9967584737 🥰 Escorts Service Available Mumbai
 
Pigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention System
Pigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention SystemPigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention System
Pigasus 2.0: FPGA‐Accelerated Intrusion Detection/Prevention System
 
Full Night Fun With Call Girls Lucknow📞7737669865 At Very Cheap Rates Doorste...
Full Night Fun With Call Girls Lucknow📞7737669865 At Very Cheap Rates Doorste...Full Night Fun With Call Girls Lucknow📞7737669865 At Very Cheap Rates Doorste...
Full Night Fun With Call Girls Lucknow📞7737669865 At Very Cheap Rates Doorste...
 

Real time analytics at uber @ strata data 2019

  • 1. Real Time Analytics at Uber: Bring SQL into Everything Zhenxiao Luo
  • 2. NYC Uber’s mission is to ignite opportunity by setting the world in motion. 15M Trips/Day 600+ Cities 75M Monthly Riders
  • 3. Data informs every decision at the company
  • 4. Overview of Uber’s Data Platform DATA SOURCES RAW DATA MODELED TABLES MINING BUSINESS INSIGHTS CONSUMING BUSINESS INSIGHTS EXPERIMENTATION DATA SCIENCE MACHINE LEARNING CUSTOM DATA SETS Dashboarding Alerting Monitoring Data Exploration Knowledge Bases Storage Infrastructure ETL Frameworks Data Integrity Query Engines
  • 5. Kafka Uber Data Infrastructure Schemaless MySQL, Postgres Vertica Streamio Raw Data Raw Tables Sqoop Reports Hadoop Hive Presto Spark Notebook Ad Hoc Queries Real Time Applications Machine Learning Jobs Business Intelligence Jobs Cluster Management All-Active Observability Security Vertica Samza Pinot Flink AresDB Modeled Tables Streaming Warehouse Real-time
  • 6. Presto @ Uber-scale 5KWeekly Active Users 160KQueries/day 3Data Centers 2KNodes 700MHDFS files read/day 10PBHDFS files processed/day
  • 7. Presto use cases at Uber Growth Marketing Data Science Marketplace Pricing Community Operations Data Quality Ad-hoc Querying
  • 8. The people who rely on us Technical Skills Data Scientists Software Engineers ML/AI Researchers Advanced SQL Advanced Statistics Scala/Spark, Python/R Data Modeling Inventor Ivan Marketing Managers Entry-level Analysts General Managers Product Managers Limited SQL Spreadsheets Reliant Rebecca City Operations Regional Managers Intermediate SQL Spreadsheets Dashboarding Monitoring Matt Operations Managers Data Analysts Product Analysts Advanced SQL Spreadsheets Limited Statistics Limited Python/R Analyst Anna
  • 9. Exploratory ML & model-training Data Scientists ML ResearchersEngineers Using ML to ensure data security and compliance
  • 10. Advanced data science & complex analytics Data Scientists Ops Analysts Support Agents Surfacing hidden insights to empower restaurants
  • 11. Business process automation S&P AnalystsOps Managers Contractors Using technology to make transportation safer
  • 12. What is Presto: Interactive SQL Engine for Big Data Interactive query speeds Horizontally scalable ANSI SQL Battle-tested by Facebook, Uber, Linkedin, Twitter, Netflix, Airbnb, etc Completely open source Access to petabytes of data in the Hadoop, Elasticsearch, Pinot, etc.
  • 14. Why Presto is Fast ● Data in memory during execution ● Pipelining and streaming ● Columnar storage & execution ● Bytecode generation
  • 15. Resource Management ● Presto has its own resource manager ○ Not on YARN ○ Not on Mesos ● CPU Management ○ Priority queues ○ Short running queries higher priority ● Memory Management ○ Max memory per query per node ○ If query exceeds max memory limit, query fails ○ No OutOfMemory in Presto process
  • 18. Contributions New Features ● Geospatial indexing and operations - 10x or more speedup ● Pinot connector enhancements (in-house) Optimizations ● Elasticsearch connector ● New Parquet reader - 4x speedup ● Nested column pushdowns (project, predicate) - 10x speedup Security ● Metastore authentication support for Kerberos deployments ● Dispatch Proxy using HTTP redirect for multi-cluster operation
  • 19. Presto Connector Interface ● ConnectorMetadata ○ Schema, Table, Column ● ConnectorSplitManager ○ Divide data into splits ● ConnectorSplit ○ Split data range ○ Predicate/JsonFunction/Limit pushdown ● ConnectorRecordCursor ○ Transform underlying storage data into Presto internal page/block
  • 21. Data Model ● each Elasticsearch index is a table partition ● each field of an index is a column ● all Elasticsearch indexes sharing the same prefix consist a logical table ○ Es-vehicles-sjc1, es-vehicles-dca1, es-vehicles
  • 23. Query
  • 24. Optimizations ● Parallel Reads ○ Get all indices and search nodes ○ For each search node, send request for one specific index ● Cap Max Hits ● Predicate Pushdown ● Json Function Pushdown ● Limit Pushdown ● Nested Fields
  • 25. How many Uber trip requests did we serve in Chicago yesterday?
  • 26. Fetch daily trip count in seconds SELECT T.base.city_id AS cid, Count(CASE WHEN T.base.status = 'completed' THEN 1 END) AS completed_trips, Count(CASE WHEN T.base.status = 'canceled' THEN 1 END) AS rider_canceled_trips FROM trips AS T WHERE T.datestr = '2019-03-11' GROUP BY 1
  • 27. Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.status Column Chunk base.vehicle_id Column Chunk base.city_idRow Group Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.city_id Column Chunk base.vehicle_id Column Chunk base.statusRow Group Parquet Parquet Footer: File Metadata, Row Group Metadata Step 1: Read all Parquet nested fields from disk base.driver_uuid base.client_uuid base.city_id …... base.vehicle_id base.status base.driver_uuid base.driver_uuid base.driver_uuid base.driver_uuid base.client_uuid base.city_id …... base.vehicle_id base.status Presto Columnar Engine Step 2: Transform Parquet rows into Presto columnar blocks Step 3: Evaluate predicates on columnar blocks base.client_uuid base.client_uuid base.client_uuid base.city_id base.city_id base.city_id base.vehicle_id base.vehicle_id base.vehicle_id base.status base.status base.status …. Default Apache Parquet Reader
  • 28. Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.status Column Chunk base.vehicle_id Column Chunk base.city_id Row Group Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.city_id Column Chunk base.vehicle_id Column Chunk base.status Row Group Parquet Footer: File Metadata, Row Group Metadata Step 1: Read ONLY Required nested fields from disk Presto Columnar Engine Apache Parquet Reader Optimization base.driver_uuid base.driver_uuid base.driver_uuid base.city_id base.city_id base.city_id Step 1: Read ONLY Required nested fields from disk Evaluate predicates on the fly: Skip reading row group; predicate: base.city_id = 12 dictionary: base.city_id: {3, 5, 9, 14, 21} Build columnar blocks only for predicate matches Step 2. Build columnar blocks on the fly base.driver_uuid base.driver_uuid base.driver_uuid Step 3: Evaluate predicates on columnar blocks Parquet
  • 31. Federated SQL Layer Vision HDFS VerticaElasticsearch Apache Pinot MySQL Machines Reports Users Presto RealTime Presto Proxy layer Management Universal Metadata Service
  • 32. Focus areas Connectors ● Apache Hive, Apache Pinot, Elasticsearch, Apache Cassandra, Vertica, MySQL, etc ● Aggregation / Join pushdown ● Cross-connector optimizations (hybrid connectors) Real-time ● Real-time mode with low latency pass through ● Query plan / result / data cache ● Time-series joins and stitching Universal Metadata Service (UMS) ● Logical definitions / physical schemas ● Column stitching and joins ● Table and partition caching
  • 33. Thank you Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed. All recipients of this document are notified that the information contained herein includes proprietary information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.
  翻译: