å°Šę•¬ēš„ å¾®äæ”걇ēŽ‡ļ¼š1円 ā‰ˆ 0.046078 元 ę”Æä»˜å®ę±‡ēŽ‡ļ¼š1円 ā‰ˆ 0.046168元 [退å‡ŗē™»å½•]
SlideShare a Scribd company logo
Paul Dix
Inļ¬‚uxData ā€“ CTO & co-founder
paul@inļ¬‚uxdata.com
@pauldix
Inļ¬‚uxDB IOx - a new columnar
time series database (update)
Ā© 2021 Inļ¬‚uxData. All rights reserved.
2
API
ā€¢ Inļ¬‚uxDB 2.x with Line Protocol
ā€¢ HTTP Query with JSON, CSV, Print
ā€¢ Arrow Flight
ā€¢ Move over to gRCP for management (and CLI)
ā€“ Create Databases
ā€“ Start to deļ¬ning replication/sharding
ā€“ Readme for gRPCurl
ā€¢ gRCP Health
Ā© 2021 Inļ¬‚uxData. All rights reserved.
3
CLI & Conļ¬g
ā€¢ Write Line Protocol from File
ā€¢ Create Database
ā€¢ Object Store parameters
Ā© 2021 Inļ¬‚uxData. All rights reserved.
4
Query
ā€¢ Queries now work across Mutable Buļ¬€er & Read Buļ¬€er
ā€¢ Data Fusion (features)
ā€¢ Massive infusion of postgres string functions (lpad, rpad, ascii, chr, ltrim, etc)
ā€¢ Support for EXTRACT (e.g. `EXTRACT hour from date_col`)
ā€¢ Data Fusion (performance)
ā€¢ Optimized function implementation for scalar values and columns
ā€¢ improved join indicies, support for more advanced statistics, expression
rewriting
Ā© 2021 Inļ¬‚uxData. All rights reserved.
5
Path to OSS Builds
ā€¢ Not until we think itā€™s useful/interesting to test
ā€¢ Dogfood our monitoring
1. In-memory 2.4M values/sec
2. Basic proxied/distributed query
3. Mutable Buļ¬€er to Read Buļ¬€er lifecycle (basic)
4. WAL Buļ¬€ering/persistence
5. Subscriptions
6. Parquet Persistence
7. Recovery
ā€¢ Single Server Steady State
ā€¢ CLI for conļ¬guration
ā€¢ Documentation
Introduction to DataFusion
An Embeddable Query Engine
Written in Rust
CC BY-SA
Today: IOx Team at InfluxData
Past life 1: Query Optimizer @ Vertica, also
on Oracle DB server
Past life 2: Chief Architect + VP Engineering
roles at some ML startups
Talk Outline
What is a Query Engine
Introduction to DataFusion / Apache Arrow
DataFusion Architectural Overview
Motivation
Data is stored
somewhere
Users who want to
access data
without writing a
program
Motivation
Users who want to
access data
without writing a
program
UIs (visual and
textual)
Data is stored
somewhere
Motivation
Users who want to
access data
without writing a
program
UIs (visual and
textual)
Data is stored
somewhere
Query Engine
SQL is the
common
interface
DataFusion Use Cases
1. Data engineering / ETL:
a. Construct fast and efficient data pipelines (~ Spark)
2. Data Science
a. Prepare data for ML / other tasks (~ Pandas)
3. Database Systems:
a. E.g. IOx, Ballista, Cloudfuse Buzz, various internal systems
Why DataFusion?
High Performance: Memory (no GC) and Performance, leveraging Rust/Arrow
Easy to Connect: Interoperability with other tools via Arrow, Parquet and Flight
Easy to Embed: Can extend data sources, functions, operators
First Class Rust: High quality Query / SQL Engine entirely in Rust
High Quality: Extensive tests and integration tests with Arrow ecosystems
My goal: DataFusion to be *the* choice for any SQL support in Rust
DBMS vs Query Engine ( , )
Database Management Systems (DBMS) are full featured systems
ā— Storage system (stores actual data)
ā— Catalog (store metadata about what is in the storage system)
ā— Query Engine (query, and retrieve requested data)
ā— Access Control and Authorization (users, groups, permissions)
ā— Resource Management (divide resources between uses)
ā— Administration utilities (monitor resource usage, set policies, etc)
ā— Clients for Network connectivity (e.g. implement JDBC, ODBC, etc)
ā— Multi-node coordination and management
DataFusion
What is DataFusion?
ā€œDataFusion is an in-memory query engine
that uses Apache Arrow as the memory
modelā€ - crates.io
ā— In Apache Arrow github repo
ā— Apache licensed
ā— Not part of the Arrow spec, uses Arrow
ā— Initially implemented and donated by
Andy Grove; design based on How
Query Engines Work
DataFusion + Arrow + Parquet
arrow
datafusion
parquet
arrow-flight
DataFusion Extensibility šŸ§°
ā— User Defined Functions
ā— User Defined Aggregates
ā— User Defined Optimizer passes
ā— User Defined LogicalPlan nodes
ā— User Defined ExecutionPlan nodes
ā— User Defined TableProvider for tables
* Built in data persistence using parquet and CSV files
What is a Query Engine?
1. Frontend
a. Query Language + Parser
2. Intermediate Query Representation
a. Expression / Type system
b. Query Plan w/ Relational Operators (Data Flow Graph)
c. Rewrites / Optimizations on that graph
3. Concrete Execution Operators
a. Allocate resources (CPU, Memory, etc)
b. Pushed bytes around, vectorized calculations, etc
ļæ½ļæ½
DataFusion is a Query Engine!
SQLStatement
1. Frontend
LogicalPlan
Expr
ExecutionPlan
RecordBatches
Rust struct
2. Intermediate Query Representation
3. Concrete Execution Operators
DataFusion Input / Output Diagram
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
RecordBatches
DataFrame
ctx.read_table("http")?
.filter(...)?
.aggregate(..)?;
RecordBatches
Catalog information:
tables, schemas, etc
OR
DataFusion in Action
DataFusion CLI
> CREATE EXTERNAL TABLE
http_api_requests_total
STORED AS PARQUET
LOCATION
'http_api_requests_total.parquet';
+--------+-----------------+
| status | COUNT(UInt8(1)) |
+--------+-----------------+
| 4XX | 73621 |
| 2XX | 338304 |
+--------+-----------------+
> SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
EXPLAIN Plan
Gets a textual representation of LogicalPlan
+--------------+----------------------------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total projection=None |
+--------------+----------------------------------------------------------+
> explain SELECT status, COUNT(1) FROM http_api_requests_total
WHERE path = '/api/v2/write' GROUP BY status;
Plans as DataFlow graphs
Filter:
#path Eq Utf8("/api/v2/write")
Aggregate:
groupBy=[[#status]],
aggr=[[COUNT(UInt8(1))]]
TableScan: http_api_requests_total
projection=None
Step 2: Predicate is applied
Step 1: Parquet file is read
Step 3: Data is aggregated
Data flows up from the
leaves to the root of the
tree
More than initially meets the eye
Use EXPLAIN VERBOSE to see optimizations applied
> EXPLAIN VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total
WHERE path = '/api/v2/write' GROUP BY status;
+----------------------+----------------------------------------------------------------+
| plan_type | plan |
+----------------------+----------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total projection=None |
| projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total
projection=Some([6, 8]) |
| type_coercion | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] |
| | Selection: #path Eq Utf8("/api/v2/write") |
| | TableScan: http_api_requests_total
projection=Some([6, 8]) |
...
+----------------------+----------------------------------------------------------------+
Optimizer ā€œpushedā€ down
projection so only status
and path columns from
file were read from
parquet
Data Representation
Array + Record Batches + Schema
+--------+--------+
| status | COUNT |
+--------+--------+
| 4XX | 73621 |
| 2XX | 338304 |
| 5XX | 42 |
| 1XX | 3 |
+--------+--------+
4XX
2XX
5XX
* StringArray representation is somewhat misleading as it actually has a fixed length portion and the character data in different locations
StringArray
1XX
StringArray
73621
338304
42
UInt64Array
3
UInt64Array
Schema:
fields[0]: ā€œstatusā€, Utf8
fields[1]: ā€œCOUNT()ā€, UInt64
RecordBatch
cols:
schema:
RecordBatch
cols:
schema:
Query Planning
DataFusion Planning Flow
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
LogicalPlan
ExecutionPlan
RecordBatches
Parsing/Planning
Optimization
Execution
ā€œQuery Planā€
PG:ā€ Query Treeā€
ā€œAccess Planā€
ā€œOperator Treeā€
PG: ā€œPlan Treeā€
DataFusion Logical Plan Creation
ā— Declarative: Describe WHAT you want; system figures out HOW
ā—‹ Input: ā€œSQLā€ text (postgres dialect)
ā— Procedural Describe HOW directly
ā—‹ Input is a program to build up the plan
ā—‹ Two options:
ā–  Use a LogicalPlanBuilder, Rust style builder
ā–  DataFrame - model popularized by Pandas and Spark
SQL ā†’ LogicalPlan
SQL Parser
SQL Query
SELECT status, COUNT(1)
FROM http_api_requests_total
WHERE path = '/api/v2/write'
GROUP BY status;
Planner
Query {
ctes: [],
body: Select(
Select {
distinct: false,
top: None,
projection: [
UnnamedExpr(
Identifier(
Ident {
value: "status",
quote_style: None,
},
),
),
...
Parsed
Statement
LogicalPlan
ā€œDataFrameā€ ā†’ Logical Plan
Rust Code
let df = ctx
.read_table("http_api_requests_total")?
.filter(col("path").eq(lit("/api/v2/write")))?
.aggregate([col("status")]), [count(lit(1))])?;
DataFrame
(Builder)
LogicalPlan
Supported Logical Plan operators (source link)
Projection
Filter
Aggregate
Sort
Join
Repartition
TableScan
EmptyRelation
Limit
CreateExternalTable
Explain
Extension
Query Optimization Overview
Compute the same (correct) result, only faster
Optimizer
Pass 1
LogicalPlan
(intermediate)
ā€œOptimizerā€
Optimizer
Pass 2
LogicalPlan
(input)
LogicalPlan
(output)
ā€¦
Other
Passes
...
Built in DataFusion Optimizer Passes (source link)
ProjectionPushDown: Minimize the number of columns passed from node to node
to minimize intermediate result size (number of columns)
FilterPushdown (ā€œpredicate pushdownā€): Push filters as close to scans as possible
to minimize intermediate result size
HashBuildProbeOrder (ā€œjoin reorderingā€): Order joins to minimize the intermediate
result size and hash table sizes
ConstantFolding: Partially evaluates expressions at plan time. Eg. ColA && true
ā†’ ColA
Expression Evaluation
Expression Evaluation
Arrow Compute Kernels typically operate on 1 or 2 arrays and/or scalars.
Partial list of included comparison kernels:
eq Perform left == right operation on two arrays.
eq_scalar Perform left == right operation on an array and a scalar value.
eq_utf8 Perform left == right operation on StringArray / LargeStringArray.
eq_utf8_scalar Perform left == right operation on StringArray / LargeStringArray and a scalar.
and Performs AND operation on two arrays. If either left or right value is null then the result is also null.
is_not_null Returns a non-null BooleanArray with whether each value of the array is not null.
or Performs OR operation on two arrays. If either left or right value is null then the result is also null.
...
Exprs for evaluating arbitrary expressions
path = '/api/v2/write' OR path IS NULL
Column
path
Literal
ScalarValue::Utf8
'/api/v2/write'
Column
path
IsNull
BinaryExpr
op: Eq
left right
BinaryExpr
op: Or
left right
col(ā€œpathā€)
.eq(lit(ā€˜api/v2/writeā€™))
.or(col(ā€œpathā€).is_null())
Expression Builder API
Expr Vectorized Evaluation
Column
path
Literal
ScalarValue::Utf8
'/api/v2/write'
Column
path
IsNull
BinaryExpr
op: Eq
BinaryExpr
op: Or
Expr Vectorized Evaluation
Literal
ScalarValue::Utf8
'/api/v2/write'
Column
path
IsNull
BinaryExpr
op: Eq
BinaryExpr
op: Or
/api/v2/write
/api/v1/write
/api/v2/read
/api/v2/write
ā€¦
/api/v2/write
/foo/bar
StringArray
Expr Vectorized Evaluation
Column
path
IsNull
BinaryExpr
op: Eq
BinaryExpr
op: Or
/api/v2/write
/api/v1/write
/api/v2/read
/api/v2/write
ā€¦
/api/v2/write
/foo/bar
StringArray
ScalarValue::Utf8(
Some(
ā€œ/api/v2/writeā€
)
)
Expr Vectorized Evaluation
Column
path
IsNull
BinaryExpr
op: Eq
BinaryExpr
op: Or
/api/v2/write
/api/v1/write
/api/v2/read
/api/v2/write
ā€¦
/api/v2/write
/foo/bar
StringArray
ScalarValue::Utf8(
Some(
ā€œ/api/v2/writeā€
)
)
Call: eq_utf8_scalar
Expr Vectorized Evaluation
Column
path
IsNull
BinaryExpr
op: Or
True
False
False
True
ā€¦
True
False
BooleanArray
Expr Vectorized Evaluation
IsNull
BinaryExpr
op: Or
True
False
False
True
ā€¦
True
False
BooleanArray
/api/v2/write
/api/v1/write
/api/v2/read
/api/v2/write
ā€¦
/api/v2/write
/foo/bar
StringArray
Expr Vectorized Evaluation
BinaryExpr
op: Or
True
False
False
True
ā€¦
True
False
BooleanArray
False
False
False
False
ā€¦
False
False
BooleanArray
Expr Vectorized Evaluation
True
False
False
True
ā€¦
True
False
BooleanArray
Type Coercion
sqrt(col)
sqrt(col) ā†’ sqrt(CAST col as Float32)
col is Int8, but sqrt implemented for Float32 or Float64
ā‡’ Type Coercion: adds typecast cast so the implementation can be called
Note: Coercion is lossless; if col was Float64, would not coerce to Float32
Source Code: coercion.rs
Execution Plans
Plan Execution Overview
Typically called the ā€œexecution engineā€ in database systems
DataFusion features:
ā— Async: Mostly avoids blocking I/O
ā— Vectorized: Process RecordBatch at a time, configurable batch size
ā— Eager Pull: Data is produced using a pull model, natural backpressure
ā— Partitioned: each operator produces partitions, in parallel
ā— Multi-Core*
* Uses async tasks; still some unease about this / if we need another thread pool
Plan Execution
LogicalPlan
ExecutionPlan
collect
SendableRecordBatchStream
Partitions
ExecutionPlan nodes allocate resources
(buffers, hash tables, files, etc)
RecordBatches
execute produces an
iterator-style thing that produces
Arrow RecordBatches for each
partition
create_physical_plan
execute
create_physical_plan
Filter:
#path Eq Utf8("/api/v2/write")
Aggregate:
groupBy=[[#status]],
aggr=[[COUNT(UInt8(1))]]
TableScan: http_api_requests_total
projection=None
HashAggregateExec (1 partition)
AggregateMode::Final
SUM(1), GROUP BY status
HashAggregateExec (2 partitions)
AggregateMode::Partial
COUNT(1), GROUP BY status
FilterExec (2 partitions)
path = ā€œ/api/v2/writeā€
ParquetExec (2 partitions)
files = file1, file2
LogicalPlan ExecutionPlan
MergeExec (1 partition)
execute
ExecutionPlan SendableRecordBatchStream
GroupHash
AggregateStream
GroupHash
AggregateStream
GroupHash
AggregateStream
FilterExecStream FilterExecStream
ā€œParquetStreamā€*
For file1
ā€œParquetStreamā€*
For file2
* this is actually a channel getting results from a different thread, as parquet reader is not yet async
HashAggregateExec (1 partition)
AggregateMode::Final
SUM(1), GROUP BY status
HashAggregateExec (2 partitions)
AggregateMode::Partial
COUNT(1), GROUP BY status
FilterExec (2 partitions)
path = ā€œ/api/v2/writeā€
ParquetExec (2 partitions)
files = file1, file2
MergeExec
MergeStream
execute(0)
execute(0)
execute(0)
execute(0)
execute(0)
execute(1)
execute(1)
execute(1)
next()
SendableRecordBatchStream
GroupHash
AggregateStream
FilterExecStream
ā€œParquetStreamā€*
For file1
Ready to produce values! šŸ˜…
Rust Stream: an async iterator that
produces record batches
Execution of GroupHash starts
eagerly (before next() is called on it)
next().await
next().await
RecordBatch
RecordBatch
Step 2:
Data is
filtered
Step 1: Data read from parquet
and returned
Step 3: data
is fed into a
hash table
Step 0: new task spawned, starts
computing input immediately
Step 5: output is requested RecordBatch
Step 6:
returned to
caller
Step 4:
hash done,
output
produced
next()
GroupHash
AggregateStream
GroupHash
AggregateStream
GroupHash
AggregateStream
next().await
Step 1: output is requested
MergeStream
MergeStream eagerly
starts on its own task, back
pressure via bounded
channels
Step 0: new task spawned, starts
computing input
RecordBatch
Step 2: eventually RecordBatch is
produced from downstream and returned
Step 0: new task spawned, starts
computing input immediately next().await
next().await
Step 0: new task spawned, starts
computing input
next().await
Step 4: data
is fed into a
hash table
RecordBatch
Step 3: Merge
passes on
RecordBatch
RecordBatch
Step 5:
hash done,
output
produced
Step 6:
returned to
caller
Get Involved
Check out the project Apache Arrow
Join the mailing list (links on project page)
Test out Arrow (crates.io) and DataFusion (crates.io) in your projects
Help out with the docs/code/tickets on GitHub
Thank You!!!!

More Related Content

What's hot

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
Ā 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
Ā 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxData
Ā 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
Ā 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
Ā 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
Ā 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
Ā 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
Ā 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
Ā 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxData
Ā 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
Ā 
Apache Spark Coreā€”Deep Diveā€”Proper Optimization
Apache Spark Coreā€”Deep Diveā€”Proper OptimizationApache Spark Coreā€”Deep Diveā€”Proper Optimization
Apache Spark Coreā€”Deep Diveā€”Proper Optimization
Databricks
Ā 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
Ā 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
Ā 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data Set
InfluxData
Ā 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
Ā 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Ā 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
Ā 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
Ā 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
Ā 

What's hot (20)

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
Ā 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Ā 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
Ā 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Ā 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Ā 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Ā 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Ā 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Ā 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Ā 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
Ā 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Ā 
Apache Spark Coreā€”Deep Diveā€”Proper Optimization
Apache Spark Coreā€”Deep Diveā€”Proper OptimizationApache Spark Coreā€”Deep Diveā€”Proper Optimization
Apache Spark Coreā€”Deep Diveā€”Proper Optimization
Ā 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Ā 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Ā 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data Set
Ā 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Ā 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Ā 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Ā 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Ā 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Ā 

Similar to InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in Apache Arrow

Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
Ā 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
Ā 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
Ā 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
PoguttuezhiniVP
Ā 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
Ā 
Data-and-Compute-Intensive processing Use Case: Lucene Domain Index
Data-and-Compute-Intensive processing Use Case: Lucene Domain IndexData-and-Compute-Intensive processing Use Case: Lucene Domain Index
Data-and-Compute-Intensive processing Use Case: Lucene Domain Index
Marcelo Ochoa
Ā 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
Ā 
Oracle DBA Training in Hyderabad
Oracle DBA Training in HyderabadOracle DBA Training in Hyderabad
Oracle DBA Training in Hyderabad
Ugs8008
Ā 
Orcale dba training
Orcale dba trainingOrcale dba training
Orcale dba training
Ugs8008
Ā 
Oracle DBA Training in Hyderabad
Oracle DBA Training in HyderabadOracle DBA Training in Hyderabad
Oracle DBA Training in Hyderabad
united global soft
Ā 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
Ā 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
jaxLondonConference
Ā 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
maclean liu
Ā 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
Ā 
Oracle DBA Online Trainingin India
Oracle DBA Online Trainingin IndiaOracle DBA Online Trainingin India
Oracle DBA Online Trainingin India
united global soft
Ā 
Oracle DBA Training in Hyderabad
Oracle DBA Training in HyderabadOracle DBA Training in Hyderabad
Oracle DBA Training in Hyderabad
united global soft
Ā 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
Ā 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
Ā 
Orcale dba Online training India
Orcale dba Online training IndiaOrcale dba Online training India
Orcale dba Online training India
united global soft
Ā 
Orcale dba training
Orcale dba trainingOrcale dba training
Orcale dba training
united global soft
Ā 

Similar to InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in Apache Arrow (20)

Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Ā 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Ā 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Ā 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
Ā 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Ā 
Data-and-Compute-Intensive processing Use Case: Lucene Domain Index
Data-and-Compute-Intensive processing Use Case: Lucene Domain IndexData-and-Compute-Intensive processing Use Case: Lucene Domain Index
Data-and-Compute-Intensive processing Use Case: Lucene Domain Index
Ā 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
Ā 
Oracle DBA Training in Hyderabad
Oracle DBA Training in HyderabadOracle DBA Training in Hyderabad
Oracle DBA Training in Hyderabad
Ā 
Orcale dba training
Orcale dba trainingOrcale dba training
Orcale dba training
Ā 
Oracle DBA Training in Hyderabad
Oracle DBA Training in HyderabadOracle DBA Training in Hyderabad
Oracle DBA Training in Hyderabad
Ā 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Ā 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
Ā 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
Ā 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Ā 
Oracle DBA Online Trainingin India
Oracle DBA Online Trainingin IndiaOracle DBA Online Trainingin India
Oracle DBA Online Trainingin India
Ā 
Oracle DBA Training in Hyderabad
Oracle DBA Training in HyderabadOracle DBA Training in Hyderabad
Oracle DBA Training in Hyderabad
Ā 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Ā 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Ā 
Orcale dba Online training India
Orcale dba Online training IndiaOrcale dba Online training India
Orcale dba Online training India
Ā 
Orcale dba training
Orcale dba trainingOrcale dba training
Orcale dba training
Ā 

More from InfluxData

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
InfluxData
Ā 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
InfluxData
Ā 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
InfluxData
Ā 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
InfluxData
Ā 
How TerƩga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How TerƩga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How TerƩga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How TerƩga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
InfluxData
Ā 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
InfluxData
Ā 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
InfluxData
Ā 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
InfluxData
Ā 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
InfluxData
Ā 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
InfluxData
Ā 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData
Ā 
Introducing InfluxDBā€™s New Time Series Database Storage Engine
Introducing InfluxDBā€™s New Time Series Database Storage EngineIntroducing InfluxDBā€™s New Time Series Database Storage Engine
Introducing InfluxDBā€™s New Time Series Database Storage Engine
InfluxData
Ā 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
InfluxData
Ā 
Understanding InfluxDBā€™s New Storage Engine
Understanding InfluxDBā€™s New Storage EngineUnderstanding InfluxDBā€™s New Storage Engine
Understanding InfluxDBā€™s New Storage Engine
InfluxData
Ā 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
InfluxData
Ā 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
InfluxData
Ā 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
InfluxData
Ā 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
InfluxData
Ā 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData
Ā 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData
Ā 

More from InfluxData (20)

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
Ā 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
Ā 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
Ā 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
Ā 
How TerƩga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How TerƩga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How TerƩga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How TerƩga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
Ā 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
Ā 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
Ā 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
Ā 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
Ā 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
Ā 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
Ā 
Introducing InfluxDBā€™s New Time Series Database Storage Engine
Introducing InfluxDBā€™s New Time Series Database Storage EngineIntroducing InfluxDBā€™s New Time Series Database Storage Engine
Introducing InfluxDBā€™s New Time Series Database Storage Engine
Ā 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
Ā 
Understanding InfluxDBā€™s New Storage Engine
Understanding InfluxDBā€™s New Storage EngineUnderstanding InfluxDBā€™s New Storage Engine
Understanding InfluxDBā€™s New Storage Engine
Ā 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Ā 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ā 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Ā 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Ā 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Ā 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Ā 

Recently uploaded

Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
Ā 
Call Girls Chennai ā˜Žļø +91-7426014248 šŸ˜ Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ā˜Žļø +91-7426014248 šŸ˜ Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ā˜Žļø +91-7426014248 šŸ˜ Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ā˜Žļø +91-7426014248 šŸ˜ Chennai Call Girl Beauty Girls Chennai...
anilsa9823
Ā 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
ThousandEyes
Ā 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
Larry Smarr
Ā 
Cyber Recovery Wargame
Cyber Recovery WargameCyber Recovery Wargame
Cyber Recovery Wargame
Databarracks
Ā 
Multivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back againMultivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back again
Kieran Kunhya
Ā 
Fuxnet [EN] .pdf
Fuxnet [EN]                                   .pdfFuxnet [EN]                                   .pdf
Fuxnet [EN] .pdf
Overkill Security
Ā 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
ScyllaDB
Ā 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
UmmeSalmaM1
Ā 
Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0
Neeraj Kumar Singh
Ā 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
ThousandEyes
Ā 
Getting Started Using the National Research Platform
Getting Started Using the National Research PlatformGetting Started Using the National Research Platform
Getting Started Using the National Research Platform
Larry Smarr
Ā 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes
Ā 
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
gaydlc2513
Ā 
Call Girls ChandigarhšŸ”„7023059433šŸ”„Agency Profile Escorts in Chandigarh Availab...
Call Girls ChandigarhšŸ”„7023059433šŸ”„Agency Profile Escorts in Chandigarh Availab...Call Girls ChandigarhšŸ”„7023059433šŸ”„Agency Profile Escorts in Chandigarh Availab...
Call Girls ChandigarhšŸ”„7023059433šŸ”„Agency Profile Escorts in Chandigarh Availab...
manji sharman06
Ā 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
DianaGray10
Ā 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
Neeraj Kumar Singh
Ā 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
Ā 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
ScyllaDB
Ā 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
NTTDATA INTRAMART
Ā 

Recently uploaded (20)

Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
Ā 
Call Girls Chennai ā˜Žļø +91-7426014248 šŸ˜ Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ā˜Žļø +91-7426014248 šŸ˜ Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ā˜Žļø +91-7426014248 šŸ˜ Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ā˜Žļø +91-7426014248 šŸ˜ Chennai Call Girl Beauty Girls Chennai...
Ā 
New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024New ThousandEyes Product Features and Release Highlights: June 2024
New ThousandEyes Product Features and Release Highlights: June 2024
Ā 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
Ā 
Cyber Recovery Wargame
Cyber Recovery WargameCyber Recovery Wargame
Cyber Recovery Wargame
Ā 
Multivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back againMultivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back again
Ā 
Fuxnet [EN] .pdf
Fuxnet [EN]                                   .pdfFuxnet [EN]                                   .pdf
Fuxnet [EN] .pdf
Ā 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
Ā 
Guidelines for Effective Data Visualization
Guidelines for Effective Data VisualizationGuidelines for Effective Data Visualization
Guidelines for Effective Data Visualization
Ā 
Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0Chapter 1 - Fundamentals of Testing V4.0
Chapter 1 - Fundamentals of Testing V4.0
Ā 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
Ā 
Getting Started Using the National Research Platform
Getting Started Using the National Research PlatformGetting Started Using the National Research Platform
Getting Started Using the National Research Platform
Ā 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
Ā 
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdfSupplier Sourcing Presentation - Gay De La Cruz.pdf
Supplier Sourcing Presentation - Gay De La Cruz.pdf
Ā 
Call Girls ChandigarhšŸ”„7023059433šŸ”„Agency Profile Escorts in Chandigarh Availab...
Call Girls ChandigarhšŸ”„7023059433šŸ”„Agency Profile Escorts in Chandigarh Availab...Call Girls ChandigarhšŸ”„7023059433šŸ”„Agency Profile Escorts in Chandigarh Availab...
Call Girls ChandigarhšŸ”„7023059433šŸ”„Agency Profile Escorts in Chandigarh Availab...
Ā 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
Ā 
Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0Chapter 5 - Managing Test Activities V4.0
Chapter 5 - Managing Test Activities V4.0
Ā 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ā 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Ā 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
Ā 

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in Apache Arrow

  • 1. Paul Dix Inļ¬‚uxData ā€“ CTO & co-founder paul@inļ¬‚uxdata.com @pauldix Inļ¬‚uxDB IOx - a new columnar time series database (update)
  • 2. Ā© 2021 Inļ¬‚uxData. All rights reserved. 2 API ā€¢ Inļ¬‚uxDB 2.x with Line Protocol ā€¢ HTTP Query with JSON, CSV, Print ā€¢ Arrow Flight ā€¢ Move over to gRCP for management (and CLI) ā€“ Create Databases ā€“ Start to deļ¬ning replication/sharding ā€“ Readme for gRPCurl ā€¢ gRCP Health
  • 3. Ā© 2021 Inļ¬‚uxData. All rights reserved. 3 CLI & Conļ¬g ā€¢ Write Line Protocol from File ā€¢ Create Database ā€¢ Object Store parameters
  • 4. Ā© 2021 Inļ¬‚uxData. All rights reserved. 4 Query ā€¢ Queries now work across Mutable Buļ¬€er & Read Buļ¬€er ā€¢ Data Fusion (features) ā€¢ Massive infusion of postgres string functions (lpad, rpad, ascii, chr, ltrim, etc) ā€¢ Support for EXTRACT (e.g. `EXTRACT hour from date_col`) ā€¢ Data Fusion (performance) ā€¢ Optimized function implementation for scalar values and columns ā€¢ improved join indicies, support for more advanced statistics, expression rewriting
  • 5. Ā© 2021 Inļ¬‚uxData. All rights reserved. 5 Path to OSS Builds ā€¢ Not until we think itā€™s useful/interesting to test ā€¢ Dogfood our monitoring 1. In-memory 2.4M values/sec 2. Basic proxied/distributed query 3. Mutable Buļ¬€er to Read Buļ¬€er lifecycle (basic) 4. WAL Buļ¬€ering/persistence 5. Subscriptions 6. Parquet Persistence 7. Recovery ā€¢ Single Server Steady State ā€¢ CLI for conļ¬guration ā€¢ Documentation
  • 6. Introduction to DataFusion An Embeddable Query Engine Written in Rust CC BY-SA
  • 7. Today: IOx Team at InfluxData Past life 1: Query Optimizer @ Vertica, also on Oracle DB server Past life 2: Chief Architect + VP Engineering roles at some ML startups
  • 8. Talk Outline What is a Query Engine Introduction to DataFusion / Apache Arrow DataFusion Architectural Overview
  • 9. Motivation Data is stored somewhere Users who want to access data without writing a program
  • 10. Motivation Users who want to access data without writing a program UIs (visual and textual) Data is stored somewhere
  • 11. Motivation Users who want to access data without writing a program UIs (visual and textual) Data is stored somewhere Query Engine SQL is the common interface
  • 12. DataFusion Use Cases 1. Data engineering / ETL: a. Construct fast and efficient data pipelines (~ Spark) 2. Data Science a. Prepare data for ML / other tasks (~ Pandas) 3. Database Systems: a. E.g. IOx, Ballista, Cloudfuse Buzz, various internal systems
  • 13. Why DataFusion? High Performance: Memory (no GC) and Performance, leveraging Rust/Arrow Easy to Connect: Interoperability with other tools via Arrow, Parquet and Flight Easy to Embed: Can extend data sources, functions, operators First Class Rust: High quality Query / SQL Engine entirely in Rust High Quality: Extensive tests and integration tests with Arrow ecosystems My goal: DataFusion to be *the* choice for any SQL support in Rust
  • 14. DBMS vs Query Engine ( , ) Database Management Systems (DBMS) are full featured systems ā— Storage system (stores actual data) ā— Catalog (store metadata about what is in the storage system) ā— Query Engine (query, and retrieve requested data) ā— Access Control and Authorization (users, groups, permissions) ā— Resource Management (divide resources between uses) ā— Administration utilities (monitor resource usage, set policies, etc) ā— Clients for Network connectivity (e.g. implement JDBC, ODBC, etc) ā— Multi-node coordination and management DataFusion
  • 15. What is DataFusion? ā€œDataFusion is an in-memory query engine that uses Apache Arrow as the memory modelā€ - crates.io ā— In Apache Arrow github repo ā— Apache licensed ā— Not part of the Arrow spec, uses Arrow ā— Initially implemented and donated by Andy Grove; design based on How Query Engines Work
  • 16. DataFusion + Arrow + Parquet arrow datafusion parquet arrow-flight
  • 17. DataFusion Extensibility šŸ§° ā— User Defined Functions ā— User Defined Aggregates ā— User Defined Optimizer passes ā— User Defined LogicalPlan nodes ā— User Defined ExecutionPlan nodes ā— User Defined TableProvider for tables * Built in data persistence using parquet and CSV files
  • 18. What is a Query Engine? 1. Frontend a. Query Language + Parser 2. Intermediate Query Representation a. Expression / Type system b. Query Plan w/ Relational Operators (Data Flow Graph) c. Rewrites / Optimizations on that graph 3. Concrete Execution Operators a. Allocate resources (CPU, Memory, etc) b. Pushed bytes around, vectorized calculations, etc ļæ½ļæ½
  • 19. DataFusion is a Query Engine! SQLStatement 1. Frontend LogicalPlan Expr ExecutionPlan RecordBatches Rust struct 2. Intermediate Query Representation 3. Concrete Execution Operators
  • 20. DataFusion Input / Output Diagram SQL Query SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; RecordBatches DataFrame ctx.read_table("http")? .filter(...)? .aggregate(..)?; RecordBatches Catalog information: tables, schemas, etc OR
  • 22. DataFusion CLI > CREATE EXTERNAL TABLE http_api_requests_total STORED AS PARQUET LOCATION 'http_api_requests_total.parquet'; +--------+-----------------+ | status | COUNT(UInt8(1)) | +--------+-----------------+ | 4XX | 73621 | | 2XX | 338304 | +--------+-----------------+ > SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status;
  • 23. EXPLAIN Plan Gets a textual representation of LogicalPlan +--------------+----------------------------------------------------------+ | plan_type | plan | +--------------+----------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") | | | TableScan: http_api_requests_total projection=None | +--------------+----------------------------------------------------------+ > explain SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status;
  • 24. Plans as DataFlow graphs Filter: #path Eq Utf8("/api/v2/write") Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] TableScan: http_api_requests_total projection=None Step 2: Predicate is applied Step 1: Parquet file is read Step 3: Data is aggregated Data flows up from the leaves to the root of the tree
  • 25. More than initially meets the eye Use EXPLAIN VERBOSE to see optimizations applied > EXPLAIN VERBOSE SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; +----------------------+----------------------------------------------------------------+ | plan_type | plan | +----------------------+----------------------------------------------------------------+ | logical_plan | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") | | | TableScan: http_api_requests_total projection=None | | projection_push_down | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | | type_coercion | Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] | | | Selection: #path Eq Utf8("/api/v2/write") | | | TableScan: http_api_requests_total projection=Some([6, 8]) | ... +----------------------+----------------------------------------------------------------+ Optimizer ā€œpushedā€ down projection so only status and path columns from file were read from parquet
  • 27. Array + Record Batches + Schema +--------+--------+ | status | COUNT | +--------+--------+ | 4XX | 73621 | | 2XX | 338304 | | 5XX | 42 | | 1XX | 3 | +--------+--------+ 4XX 2XX 5XX * StringArray representation is somewhat misleading as it actually has a fixed length portion and the character data in different locations StringArray 1XX StringArray 73621 338304 42 UInt64Array 3 UInt64Array Schema: fields[0]: ā€œstatusā€, Utf8 fields[1]: ā€œCOUNT()ā€, UInt64 RecordBatch cols: schema: RecordBatch cols: schema:
  • 29. DataFusion Planning Flow SQL Query SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; LogicalPlan ExecutionPlan RecordBatches Parsing/Planning Optimization Execution ā€œQuery Planā€ PG:ā€ Query Treeā€ ā€œAccess Planā€ ā€œOperator Treeā€ PG: ā€œPlan Treeā€
  • 30. DataFusion Logical Plan Creation ā— Declarative: Describe WHAT you want; system figures out HOW ā—‹ Input: ā€œSQLā€ text (postgres dialect) ā— Procedural Describe HOW directly ā—‹ Input is a program to build up the plan ā—‹ Two options: ā–  Use a LogicalPlanBuilder, Rust style builder ā–  DataFrame - model popularized by Pandas and Spark
  • 31. SQL ā†’ LogicalPlan SQL Parser SQL Query SELECT status, COUNT(1) FROM http_api_requests_total WHERE path = '/api/v2/write' GROUP BY status; Planner Query { ctes: [], body: Select( Select { distinct: false, top: None, projection: [ UnnamedExpr( Identifier( Ident { value: "status", quote_style: None, }, ), ), ... Parsed Statement LogicalPlan
  • 32. ā€œDataFrameā€ ā†’ Logical Plan Rust Code let df = ctx .read_table("http_api_requests_total")? .filter(col("path").eq(lit("/api/v2/write")))? .aggregate([col("status")]), [count(lit(1))])?; DataFrame (Builder) LogicalPlan
  • 33. Supported Logical Plan operators (source link) Projection Filter Aggregate Sort Join Repartition TableScan EmptyRelation Limit CreateExternalTable Explain Extension
  • 34. Query Optimization Overview Compute the same (correct) result, only faster Optimizer Pass 1 LogicalPlan (intermediate) ā€œOptimizerā€ Optimizer Pass 2 LogicalPlan (input) LogicalPlan (output) ā€¦ Other Passes ...
  • 35. Built in DataFusion Optimizer Passes (source link) ProjectionPushDown: Minimize the number of columns passed from node to node to minimize intermediate result size (number of columns) FilterPushdown (ā€œpredicate pushdownā€): Push filters as close to scans as possible to minimize intermediate result size HashBuildProbeOrder (ā€œjoin reorderingā€): Order joins to minimize the intermediate result size and hash table sizes ConstantFolding: Partially evaluates expressions at plan time. Eg. ColA && true ā†’ ColA
  • 37. Expression Evaluation Arrow Compute Kernels typically operate on 1 or 2 arrays and/or scalars. Partial list of included comparison kernels: eq Perform left == right operation on two arrays. eq_scalar Perform left == right operation on an array and a scalar value. eq_utf8 Perform left == right operation on StringArray / LargeStringArray. eq_utf8_scalar Perform left == right operation on StringArray / LargeStringArray and a scalar. and Performs AND operation on two arrays. If either left or right value is null then the result is also null. is_not_null Returns a non-null BooleanArray with whether each value of the array is not null. or Performs OR operation on two arrays. If either left or right value is null then the result is also null. ...
  • 38. Exprs for evaluating arbitrary expressions path = '/api/v2/write' OR path IS NULL Column path Literal ScalarValue::Utf8 '/api/v2/write' Column path IsNull BinaryExpr op: Eq left right BinaryExpr op: Or left right col(ā€œpathā€) .eq(lit(ā€˜api/v2/writeā€™)) .or(col(ā€œpathā€).is_null()) Expression Builder API
  • 40. Expr Vectorized Evaluation Literal ScalarValue::Utf8 '/api/v2/write' Column path IsNull BinaryExpr op: Eq BinaryExpr op: Or /api/v2/write /api/v1/write /api/v2/read /api/v2/write ā€¦ /api/v2/write /foo/bar StringArray
  • 41. Expr Vectorized Evaluation Column path IsNull BinaryExpr op: Eq BinaryExpr op: Or /api/v2/write /api/v1/write /api/v2/read /api/v2/write ā€¦ /api/v2/write /foo/bar StringArray ScalarValue::Utf8( Some( ā€œ/api/v2/writeā€ ) )
  • 42. Expr Vectorized Evaluation Column path IsNull BinaryExpr op: Eq BinaryExpr op: Or /api/v2/write /api/v1/write /api/v2/read /api/v2/write ā€¦ /api/v2/write /foo/bar StringArray ScalarValue::Utf8( Some( ā€œ/api/v2/writeā€ ) ) Call: eq_utf8_scalar
  • 43. Expr Vectorized Evaluation Column path IsNull BinaryExpr op: Or True False False True ā€¦ True False BooleanArray
  • 44. Expr Vectorized Evaluation IsNull BinaryExpr op: Or True False False True ā€¦ True False BooleanArray /api/v2/write /api/v1/write /api/v2/read /api/v2/write ā€¦ /api/v2/write /foo/bar StringArray
  • 45. Expr Vectorized Evaluation BinaryExpr op: Or True False False True ā€¦ True False BooleanArray False False False False ā€¦ False False BooleanArray
  • 47. Type Coercion sqrt(col) sqrt(col) ā†’ sqrt(CAST col as Float32) col is Int8, but sqrt implemented for Float32 or Float64 ā‡’ Type Coercion: adds typecast cast so the implementation can be called Note: Coercion is lossless; if col was Float64, would not coerce to Float32 Source Code: coercion.rs
  • 49. Plan Execution Overview Typically called the ā€œexecution engineā€ in database systems DataFusion features: ā— Async: Mostly avoids blocking I/O ā— Vectorized: Process RecordBatch at a time, configurable batch size ā— Eager Pull: Data is produced using a pull model, natural backpressure ā— Partitioned: each operator produces partitions, in parallel ā— Multi-Core* * Uses async tasks; still some unease about this / if we need another thread pool
  • 50. Plan Execution LogicalPlan ExecutionPlan collect SendableRecordBatchStream Partitions ExecutionPlan nodes allocate resources (buffers, hash tables, files, etc) RecordBatches execute produces an iterator-style thing that produces Arrow RecordBatches for each partition create_physical_plan execute
  • 51. create_physical_plan Filter: #path Eq Utf8("/api/v2/write") Aggregate: groupBy=[[#status]], aggr=[[COUNT(UInt8(1))]] TableScan: http_api_requests_total projection=None HashAggregateExec (1 partition) AggregateMode::Final SUM(1), GROUP BY status HashAggregateExec (2 partitions) AggregateMode::Partial COUNT(1), GROUP BY status FilterExec (2 partitions) path = ā€œ/api/v2/writeā€ ParquetExec (2 partitions) files = file1, file2 LogicalPlan ExecutionPlan MergeExec (1 partition)
  • 52. execute ExecutionPlan SendableRecordBatchStream GroupHash AggregateStream GroupHash AggregateStream GroupHash AggregateStream FilterExecStream FilterExecStream ā€œParquetStreamā€* For file1 ā€œParquetStreamā€* For file2 * this is actually a channel getting results from a different thread, as parquet reader is not yet async HashAggregateExec (1 partition) AggregateMode::Final SUM(1), GROUP BY status HashAggregateExec (2 partitions) AggregateMode::Partial COUNT(1), GROUP BY status FilterExec (2 partitions) path = ā€œ/api/v2/writeā€ ParquetExec (2 partitions) files = file1, file2 MergeExec MergeStream execute(0) execute(0) execute(0) execute(0) execute(0) execute(1) execute(1) execute(1)
  • 53. next() SendableRecordBatchStream GroupHash AggregateStream FilterExecStream ā€œParquetStreamā€* For file1 Ready to produce values! šŸ˜… Rust Stream: an async iterator that produces record batches Execution of GroupHash starts eagerly (before next() is called on it) next().await next().await RecordBatch RecordBatch Step 2: Data is filtered Step 1: Data read from parquet and returned Step 3: data is fed into a hash table Step 0: new task spawned, starts computing input immediately Step 5: output is requested RecordBatch Step 6: returned to caller Step 4: hash done, output produced
  • 54. next() GroupHash AggregateStream GroupHash AggregateStream GroupHash AggregateStream next().await Step 1: output is requested MergeStream MergeStream eagerly starts on its own task, back pressure via bounded channels Step 0: new task spawned, starts computing input RecordBatch Step 2: eventually RecordBatch is produced from downstream and returned Step 0: new task spawned, starts computing input immediately next().await next().await Step 0: new task spawned, starts computing input next().await Step 4: data is fed into a hash table RecordBatch Step 3: Merge passes on RecordBatch RecordBatch Step 5: hash done, output produced Step 6: returned to caller
  • 55. Get Involved Check out the project Apache Arrow Join the mailing list (links on project page) Test out Arrow (crates.io) and DataFusion (crates.io) in your projects Help out with the docs/code/tickets on GitHub Thank You!!!!
  ēæ»čƑļ¼š