Superpower Your Apache Kafka Applications Development with Complementary Open Source Technologies

© 2024 NetApp, Inc. All rights reserved.
Kafka Summit, Bangalore 2024
Superpower your Apache Kafka®
applications development
with complementary
open source technologies
Paul Brebner
Instaclustr Technology Evangelist

Focus on complementary technologies –
different to Kafka
“Colours seem more brilliant when they are in contrast
with their complementary colours.” Monet

Complementary Colours
Matisse, Goldfish -
Red/Green
complementary colors
(Source: Wikimedia)

Contrasting flowers from the Bengaluru market
Bengaluru market flowers (Paul Brebner)

Complementary Kafka Technologies
Cassandra PostgreSQL
Superset
Camel
Cadence
OpenTelemetry
TensorFlow
RisingWave
LLMs
Guava EventBus
Kubernetes
Prometheus
Grafana
Parallel Consumer
OpenSearch + Dashboard
Matisse, Goldfish - Red/Green
complementary colors
(Source: Wikimedia)

C.f. analogous Kafka technologies
• Apache Pulsar, Flink, Storm, Spark Streaming, Beam,
ActiveMQ, RocketMQ, StreamPark, RisingWave etc.
Van Gogh, Sunflowers on
Yellow Background,
(Source: Wikimedia)
But we will look at
RisingWave

Approach
Use Cases
Technologies
Superpowers

0. Apache Kafka®

Apache Kafka®
Postal Delivery Service
Railway Post Office:
Mail bags snatched by speeding train
(Source: Wikimedia CCL)

Apache Kafka visual introduction
My first Kafka talk: Visual introduction to a Kafka postal service

Christmas tree lights simulation
Christmas 2017
My first Kafka demo application
100% Kafka
A simple simulation –
to start with

Use case 1: “Kongo” IoT logistics simulation
• Real-time logistics
• IoT transportation and rules checking
• Complex simulation

Design 1: Pure Kafka, many topics
1000s of locations (warehouses, trucks)
and millions of goods
Each location has a topic
and multiple consumer groups
(all goods at that location)
7,000 TPS → SLOW!
Many topics/partitions (without increasing
cluster resources) reduced throughput on
older versions of Kafka

1. Guava EventBus

Guava EventBus
Telegram messengers
(Source: Wikimedia CCL)

Design 2: One topic + Guava EventBus for notifications
Single topic, one consumer group
Kafka supplemented with Guava
Event Bus to handle high fan-out
notifications
1.2M TPS → FAST!
Uber’s Cadence can be/has been
used for scalable notifications

Use case 2: Anomaly detection at scale
One of these things is not like the others…
(Source: Shutterstock)

Streaming anomaly detection
Incoming Event Stream
Run Anomaly Check – Quickly!
Persist new event
Get previous 50 events for key
Run algorithm
Fast writes → Cassandra
Application scaling → Kubernetes
Initially single threaded consumers

2. Apache Cassandra®

Apache Cassandra®
Fast Writes
Office typing pool, 1918
(Source: Wikimedia)

Apache Cassandra®
What?
• NoSQL horizontally scalable key-value database
Superpowers
• Fast writes (lots of typewriters)
• Wide column store
• Good for ML feature stores
• Clustering columns
• Good for hierarchical data modeling (eg. Geospatial)
• In-built multi-DC replication

3. Kubernetes

Kubernetes
Greek Triremes ruled the seas
Captained by Helmsmen (Kubernetes)
(Source: Wikimedia)

Kubernetes
What?
• Automation of containerized applications
Superpowers
• Available on public clouds (E.g. AWS EKS)
• Ephemeral Pods are the unit of concurrency
• Easy to scale applications with more or less Pods

But scalability isn’t great

4. Prometheus
5. Grafana

Kubernetes
Abacus counting
(Source: Wikimedia)

Prometheus + Grafana
What?
• Prometheus: Monitoring and alerting
• Grafana: Graphing
Superpowers
• Instrumentation or agents (exporters) to expose application metrics
• Time series data with counter, gauge, histogram, and summary metrics
• Instaclustr monitoring API supports Prometheus metrics for Apache Kafka clusters
• Integration of Kafka Cluster metrics and Kafka application (e.g. producers and
consumers) is powerful
à Metrics suggested optimizations

Slow Kafka consumers problem
Slow consumers require more partitions/consumers
(Source: Getty Images)
Little’s Law: Concurrency (Partitions=Consumers) = Time x Throughput

2 pool solution
The famous Bondi Ocean Pool in Sydney Australia has 2 pools

Optimize consumer speed/concurrency using 2 stage pipeline
Less consumers
(around 100) gives
higher throughput—
a surprise!
Hint: Less partitions
1. Minimize polling time
(thread pool 1)
2. Maximize anomaly
detector concurrency
(thread pool 2)
1
2

19 billion checks/day after tuning

6. Kafka Parallel
Consumer

Kafka Parallel Consumer
Jacquard Loom, Berlin
Makes multiple ribbons
concurrently
(Source: Paul Brebner)

Kafka Parallel Consumer: Multi-threaded consumer
• Multiple ordering options—c.f. default Kafka only guarantees order within partitions!
PARTITION → KEY → UNORDERED
Increasing concurrency →
• Concurrency from 1 to lots—depends on client resources, and partitions/key
space sizes
• KEY has higher concurrency than partition and is ordered by KEY—
reasonable compromise
• Higher concurrency for less partitions/consumers

Experimental results
3, 50, and 200 times improvement, unordered best
1 consumer
10 partitions
100 keys
10ms latency

Use case 3: Pipelines
Berlin “Beer” (?) Pipeline
(Source: Paul Brebner)

Kafka® Connect data pipelines
REST Tidal Data to OpenSearch REST Tidal Data to PostgreSQL + Superset
Alternative sinks
Kafka Connectors

7. OpenSearch
8. Dashboard

Library of Congress
Card Division 1919
(city block long)
(Source: Wikimedia)

What?
• Open source version of Elasticsearch
• Based on Lucene—powerful and scalable text searching
Superpowers
• Ingestion, indexing, and searching of JSON documents
• Complex linguistic and geospatial queries
• Integrated dashboard for visualization

9. PostgreSQL®
®

PostgreSQL®
Elephant vs. tree
Elephants are powerful
(Source: Adobe Stock)

PostgreSQL®
What?
• Powerful SQL database
Superpowers
• Extensible
• JSONB+GIN indexes (efficient storage and search of JSON)
®

10. Apache Superset™

Apache Superset™
Superhero Supersets
All superheroes (B) are a
superset of those who
use weapons (A)

Apache Superset™
What?
• Powerful data visualization tool
Superpowers
• Reads from SQL sources
• Lots of visualization and graph types, including geospatial

11. Apache Camel™

Apache Camel™
Camel train

Apache Camel™
What?
• Apache Camel – integration framework
• Apache Camel Kafka Connectors
Superpowers
• Large number of open source Kafka Connectors—179 sources and sinks
• Auto-generated from Camel components

Use case 4: Drone delivery

12. Uber’s Cadence®

Cadence®
Railway signal“man”
(signalwoman!)
(Source: Wikimedia)

Uber’s Cadence®
What?
• Scalable code-as-workflows engine
Superpowers
• Sequenced, stateful, long-running, scheduled steps
• Scalable and reliable using event-sourcing
o Workflows are failproof, history is replayed until the point of failure and resumed

Drone delivery application
Computationally
expensive mission
critical
calculations
Kafka microservices integration
of fast/slow systems

Drone way point flight calculations
Returning to base leg
• Drone flight path is computed in an activity
• Using location, distance, bearing, speed,
and charge
• Every 10 seconds
• On failure, the drone won’t crash and will
continue flying from the last location

Uber’s Cadence + Apache Kafka = similarities
Cadence (Workflows) Kafka (Streaming Events)
Scalable (event sourcing) Scalable (partitions, cluster)
Persistent (event sourcing) Persistent (event replaying)
Reliable workflow execution (event sourcing) Reliable event delivery
Asynchronous signals Asynchronous events
Open source Open source
Available as a managed service Available as a managed service

Uber’s Cadence =
Orchestration (synchronous/timed sequences)
Different architectural
(musical) styles

Apache Kafka =
Choreography (asynchronous)
Different architectural
(musical) styles

Combined Cadence + Kafka = Ballet!
Integrated in a
new style

Cadence + Kafka = Complementary timescales

Cadence + Kafka = Complementary timescales
Cadence (Slow Workflows) Kafka (Fast Streaming Events)
Synchronous events Asynchronous events
Stateful flows Stateless events
Sequences One-off events
Slow/long running flows Fast/instantaneous events
Sleep/schedule events Real-time processing of events
Complex flow logic Complex stream processing (Kafka Streams)

Cadence + Kafka =
Integration → Drone Ballet
Drone show, Japan

How many drones can we fly?

Cluster Details (VCPUS):
Client (8), Cadence (6), Cassandra (18)

Load test:
2,000 drones + 2,000 orders = 4,000 workflows

20 Drones flying
Purple = base
Black = drone
Orange = shop
Red = delivery location
Green = successful delivery

Use case 5: Streaming ML
(Source: Getty Images) (Source: Getty Images)
Busy! Not Busy!
Shop busy/not busy prediction

Drone learning problem
Kafka Streams
Kafka Streams computes
aggregated hourly shop and order
details →
Busy/NotBusy categorization
Sent to TensorFlow
Train model to predict shop
busy/not busy an hour ahead
Simulation produces streaming
spatiotemporal data (drone and
order state and locations)

13. TensorFlow

TensorFlow
What does the
future hold?

TensorFlow
What?
• Neural network ML library
Superpowers
• Supports incremental ML
• From streaming Kafka data

TensorFlow
Watch out for
• ML over streaming spatiotemporal data with concept drifts is tricky
o Time/space bias
- Wild model accuracy oscillation
o Concept shift can result in very low-accuracy models initially
- Train/use multiple models

Use case 6:
Santa’s elves' toy and box packing
KafkaStreams, ChatGPT, RisingWave, and OpenTelemetry
Streaming joins to match toys and boxes

14. OpenTelemetry

OpenTelemetry
X-ray vision!
(Source: Wikimedia Public Domain)

OpenTelemetry
• OpenTelemetry is the new standard for distributed tracing
• Combines tracing (OpenTracing), metrics, and logs
• Automatic instrumentation
• Lots of open source visualization tools
- Jager, SigNoz, Uptrace, etc.
• Used in new client monitoring KIP-714
- Kafka 3.7.0

SigNoz service map for
toy+boxes application

15. RisingWave

RisingWave
Wave processing

RisingWave
What?
• Stream processing database—also as a service
Superpowers
• Stateful stream processing
o SQL syntax
o Using cloud native storage
o Potential replacement for Kafka Streams
• PostgreSQL compatible
o Works with Apache Superset for visualization

16. LLMs

LLMs
The Answer?
(Source: Wikimedia)

LLMs/GenAI
• E.g. ChatGPT
- not open source
+ there may be suitable open source alternatives
for code generation
• Worked well to generate
+ Kafka clients
+ Kafka Streams DSL
+ and test-cases
• Not as accurate for RisingWave
- lack of examples?

Bonus Technologies from my Instaclustr colleagues
● Kafka benchmarking
○ Apache JMeter for Kafka benchmarking (Thanks to Anup Shirolkar)
○ OpenMessaging (Thanks to Alastair Daivis)
● Strimzi – a Kafka Operator for Kubernetes, and Debezium (CDC using Kafka Connect)
(Thanks to Felix Alipaz-Dicke)
● Kafka GUIs (Thanks to Ana-Maria Minda)
○ Kafdrop
○ AKHQ
○ UI for Apache Kafka
○ These all work with Kafka + Instaclustr console and provide complementary features

Ballet pattern à Hanoi street intersection pattern
● A working integrated synchronous + asynchronous system

I survived as a pedestrian!

Try us out
• We offer Apache Kafka and
these open source technologies
as a managed service
• You can use the others with our
managed services
• FREE 30-day trial of developer-
sized clusters

Paul Brebner | Instaclustr Technology Evangelist
www.Instaclustr.com/paul-brebner à All my blogs
Thank You!

Superpower Your Apache Kafka Applications Development with Complementary Open Source Technologies

Recommended

Recommended

More Related Content

Similar to Superpower Your Apache Kafka Applications Development with Complementary Open Source Technologies

Similar to Superpower Your Apache Kafka Applications Development with Complementary Open Source Technologies (20)

More from Paul Brebner

More from Paul Brebner (20)

Recently uploaded

Recently uploaded (20)

Superpower Your Apache Kafka Applications Development with Complementary Open Source Technologies