尊敬的 微信汇率:1円 ≈ 0.046089 元 支付宝汇率:1円 ≈ 0.04618元 [退出登录]
SlideShare a Scribd company logo
Decomposing SMACK Stack
Spark & Mesos Internals
Anton Kirillov Apache Spark Meetup
intro by Sebastian Stoll Oooyala, March 2016
Who is this guy?
@antonkirillov
● Staff Engineer in Data Team @ Ooyala
● Scala programmer
● Focused on distributed systems
● Building data platforms with SMACK/Hadoop
● Ph.D. in Computer Science
● blog: datastrophic.io
● github: github.com/datastrophic
Roadmap
● Intro to Ooyala stack
● SMACK stack
○ Overview
○ Architecture design options
● Spark
○ Core concepts & execution workflow
○ Architecture
● Mesos
○ Cluster resource management
○ Architecture and scheduling
○ Frameworks
○ Spark on Mesos
SMACK Stack Overview
components and architecture designs
SMACK Stack
● Spark - a generalized framework for distributed data processing
supporting in-memory data caching and reuse across computations
● Mesos - cluster resource management system that provides efficient
resource isolation and sharing across distributed applications
● Akka - a toolkit and runtime for building highly concurrent, distributed,
and resilient message-driven applications on the JVM
● Cassandra - distributed, highly available database designed to handle
large amounts of data across multiple datacenters
● Kafka - a high-throughput, low-latency distributed messaging system
designed for handling real-time data feeds
Storage Layer: Cassandra
● Pros:
○ optimized for heavy write
loads
○ configurable CA (CAP)
○ linearly scalable
○ XDCR support
○ easy cluster resizing and
inter-DC data migration
● Cons:
○ data model (distributed
nested sorted map)
○ designed for fast serving
but not batch processing
○ not well-suited for ad-hoc
queries against historical
raw data
Fixing NoSQL limitations with Spark
//joining raw events with rolled-up and grouping by type
sqlContext.sql {"""
SELECT
events.campaignId,
events.eventType,
events.value + campaigns.total as total_events
FROM events
JOIN campaigns
ON events.campaignId = campaigns.id AND events.eventType = campaigns.eventType
""".stripMargin
}.registerTempTable("joined")
sqlContext.sql {"""
SELECT campaignId, eventType, sum(total_events) as total
FROM joined
GROUP BY campaignId, eventType
""".stripMargin
}.saveAsCassandraTable(”keyspace”, ”totals”)
Architecture of Spark/Cassandra Clusters
Separate Write & Analytics:
● clusters can be scaled
independently
● data is replicated by
Cassandra asynchronously
● Analytics has different
Read/Write load patterns
● Analytics contains additional
data and processing results
● Spark resource impact
limited to only one DC
To fully facilitate Spark-C* connector data locality awareness,
Spark workers should be collocated with Cassandra nodes (gotcha: CL=ONE)
Mesos as Spark cluster manager
● fine-grained resource
sharing between Spark
and other applications
● scalable partitioning
between multiple
instances of Spark
● unified platform for
running various
applications
(frameworks)
● fault-tolerant and
scalable
Stream Processing with Kafka and Spark
● be prepared for failures and broken data
● backup and patching strategies should be designed upfront
● patch/restore if time interval could be done by replay if store is idempotent
Spark Streaming with Kafka
val streamingContext = new StreamingContext(sc.getConf, Seconds(10))
val eventStream = KafkaUtils.createStream(
ssc = streamingContext,
zkQuorum = "zoo01,zoo02,zoo03",
groupId = "spark_consumer",
topics = Map("raw_events" -> 3)
)
eventStream.map(_.toEvent)
.saveToCassandra(keyspace, table)
streamingContext.start()
streamingContext.awaitTermination()
Data Ingestion with Akka
● actor model implementation
for JVM
● message-based and
asynchronous
● easily scalable from one
process to cluster of
machines
● actor hierarchies with
parental supervision
● easily packages in Docker to
be run on Mesos
Akka Http microservice
val config = new ProducerConfig(KafkaConfig())
lazy val producer = new KafkaProducer[A, A](config)
val routes: Route = {
post{
decodeRequest{
entity(as[String]){ str =>
JsonParser.parse(str).validate[Event] match {
case s: JsSuccess[String] =>
producer.send(new KeyedMessage(topic, str))
system.actorOf(Props[CassandraWriter]) ! s.get
case e: JsError => BadRequest -> JsError.toFlatJson(e).toString()
}
}
}
}
}
object AkkaHttpMicroservice extends App with Service {
Http().bindAndHandle(routes, config.getString("http.interface"), config.getInt("http.port"))
}
Writing to Cassandra with Akka
class CassandraWriterActor extends Actor with ActorLogging {
//for demo purposes, session initialized here
val session = Cluster.builder()
.addContactPoint("cassandra.host")
.build()
.connect()
override def receive: Receive = {
case event: Event =>
val statement = new SimpleStatement(event.createQuery)
.setConsistencyLevel(ConsistencyLevel.QUORUM)
Try(session.execute(statement)) match {
case Failure(ex) => //error handling code
case Success => sender ! WriteSuccessfull
}
}
}
Lambda Architecture with SMACK
● when design meets reality it’s hard to implement canonical architecture
● depending on the use case it’s easy to implement Kappa architecture as well
SMACK stack:
● concise toolbox for wide variety of data processing scenarios
● battle-tested and widely used software with large communities
● easy scalability and replication of data while preserving low latencies
● unified cluster management for heterogeneous loads
● single platform for any kind of applications
● implementation platform for different architecture designs
● really short time-to-market (e.g. for MVP verification)
Apache Spark in Depth
core concepts, architecture & internals
Meet Spark
● Generalized framework for distributed data processing (batch, graph, ML)
● Scala collections functional API for manipulating data at scale
● In-memory data caching and reuse across computations
● Applies set of coarse-grained transformations over partitioned data
● Failure recovery relies on lineage to recompute failed tasks
● Supports majority of input formats and integrates with Mesos / YARN
Spark makes data engineers happy
Backup/restore of Cassandra tables in Parquet
def backup(config: Config) {
sc.cassandraTable(config.keyspace, config.table).map(_.toEvent).toDF()
.write.parquet(config.path)
}
def restore(config: Config) {
sqlContext.read.parquet(config.path)
.map(_.toEvent).saveToCassandra(config.keyspace, config.table)
}
Query different data sources to identify discrepancies
sqlContext.sql {
"""
SELECT count()
FROM cassandra_event_rollups
JOIN mongo_event_rollups
ON cassandra_event_rollups.uuid = cassandra_event_rollups.uuid
WHERE cassandra_event_rollups.value != cassandra_event_rollups.value
""".stripMargin
}
Core Concepts
RDD: Resilient Distributed Dataset
● A fault-tolerant, immutable, parallel data structure
● Provides API for
○ manipulating the collection of elements (transformations and materialization)
○ persisting intermediate results in memory for later reuse
○ controlling partitioning to optimize data placement
● Can be created through deterministic operation
○ from storage (distributed file system, database, plain file)
○ from another RDD
● Stores information about parent RDDs
○ for execution optimization and operations pipelining
○ to recompute the data in case of failure
RDD: a developer’s view
● Distributed immutable data + lazily evaluated operations
○ partitioned data + iterator
○ transformations & actions
● An interface defining 5 main properties
a list of partitions (e.g. splits in Hadoop)
def getPartitions: Array[Partition]
a list of dependencies on other RDDs
def getDependencies: Seq[Dependency[_]]
a function for computing each split
def compute(split: Partition, context: TaskContext): Iterator[T]
(optional) a list of preferred locations to compute each split on
def getPreferredLocations(split: Partition): Seq[String] = Nil
(optional) a partitioner for key-value RDDs
val partitioner: Option[Partitioner] = None
lineage
execution optimization
RDDs Example
● HadoopRDD
○ getPartitions = HDFS blocks
○ getDependencies = None
○ compute = load block in memory
○ getPrefferedLocations = HDFS block locations
○ partitioner = None
● MapPartitionsRDD
○ getPartitions = same as parent
○ getDependencies = parent RDD
○ compute = compute parent and apply map()
○ getPrefferedLocations = same as parent
○ partitioner = None
sparkContext.textFile("hdfs://...")
RDD Operations
● Transformations
○ apply user function to every element in a partition (or to the whole partition)
○ apply aggregation function to the whole dataset (groupBy, sortBy)
○ introduce dependencies between RDDs to form DAG
○ provide functionality for repartitioning (repartition, partitionBy)
● Actions
○ trigger job execution
○ used to materialize computation results
● Extra: persistence
○ explicitly store RDDs in memory, on disk or off-heap (cache, persist)
○ checkpointing for truncating RDD lineage
Execution workflow
29
rdd1.join(rdd2)
.groupBy(...)
.filter(...)
splits graph into
stages of tasks
submits each stage
as ready
launches tasks via
cluster manager
retries failed or
struggling tasks
executes tasks
stores and serves
blocks
Code sample: joining aggregated and raw data
//aggregate events after specific date for given campaign
val events = sc.cassandraTable("demo", "event")
.map(_.toEvent)
.filter(event => event.campaignId == campaignId && event.time.isAfter(watermark))
.keyBy(_.eventType)
.reduceByKey(_ + _)
.cache()
//aggregate campaigns by type
val campaigns = sc.cassandraTable("demo", "campaign")
.map(_.toCampaign)
.filter(campaign => campaign.id == campaignId && campaign.time.isBefore(watermark))
.keyBy(_.eventType)
.reduceByKey(_ + _)
.cache()
//joined rollups and raw events
val joinedTotals = campaigns.join(events)
.map { case (key, (campaign, event)) => CampaignTotals(campaign, event) }
.collect()
//count totals separately
val eventTotals = events.map{ case (t, e) => s"$t -> ${e.value}" }.collect()
val campaignTotals = campaigns.map{ case (t, e) => s"$t -> ${e.value}" }.collect()
DAG
Dependency types
● Narrow (pipelineable)
○ each partition of the parent RDD is used by at most
one partition of the child RDD
○ allow for pipelined execution on one cluster node
○ failure recovery is more efficient as only lost parent
partitions need to be recomputed
● Wide (shuffle)
○ multiple child partitions may depend on one parent
partition
○ require data from all parent partitions to be available
and to be shuffled across the nodes
○ if some partition is lost from all the ancestors a
complete recomputation is needed
Stages and Tasks
● Stages breakdown strategy
○ check backwards from final RDD
○ add each “narrow” dependency to
the current stage
○ create new stage when there’s a
shuffle dependency
● Tasks
○ ShuffleMapTask partitions its
input for shuffle
○ ResultTask sends its output to
the driver
Shuffle
● Shuffle Write
○ redistributes data among partitions
and writes files to disk
○ each shuffle task creates one file
with regions assigned to reducer
○ sort shuffle uses in-memory sorting
with spillover to disk to get final
result
● Shuffle Read
○ fetches the files and applies
reduce() logic
○ if data ordering is needed then it is
sorted on “reducer” side for any
type of shuffle
Sort Shuffle
● Incoming records accumulated
and sorted in memory according
their target partition ids
● Sorted records are written to file
or multiple files if spilled and
then merged
● index file stores offsets of the
data blocks in the data file
● Sorting without deserialization is
possible under certain conditions
(SPARK-7081)
Spark Components
Memory Management in Spark 1.6
● Execution Memory
○ storage for data needed during tasks execution
○ shuffle-related data
● Storage Memory
○ storage of cached RDDs and broadcast variables
○ possible to borrow from execution memory
(spill otherwise)
○ safeguard value is 0.5 of Spark Memory when cached
blocks are immune to eviction
● User Memory
○ user data structures and internal metadata in Spark
○ safeguarding against OOM
● Reserved memory
○ memory needed for running executor itself and not
strictly related to Spark
Execution Modes
● spark-shell --master [ local | spark | yarn-client | mesos]
○ launches REPL connected to specified cluster manager
○ always runs in client mode
● spark-submit --master [ local | spark:// | mesos:// | yarn ] spark-job.jar
○ launches assembly jar on the cluster
● Masters
○ local[k] - run Spark locally with K worker threads
○ spark - launches driver app on Spark Standalone installation
○ mesos - driver will spawn executors on Mesos cluster (deploy-mode: client | cluster)
○ yarn - same idea as with Mesos (deploy-mode: client | cluster)
● Deploy Modes
○ client - driver executed as a separate process on the machine where it has been launched and
spawns executors
○ cluster - driver launched as a container using underlying cluster manager
Apache Mesos
architecture, scheduling, frameworks & Spark
Cluster Resource Managers: Requirements
● Efficiency
○ efficient sharing of resources across applications
○ utilization of cluster resources in the most optimal manner
● Flexibility
○ support of wide array of current and future frameworks
○ dealing with hardware heterogeneity
○ support of resource requests of different types
● Scalability
○ scaling to clusters of dozens of thousands of nodes
○ scheduling system’s response times must remain acceptable while
increasing number of machines and applications
● Robustness
○ fault-tolerant guarantees for the system and applications
○ high availability of central scheduler component
Cluster Manager Architectures
source: Omega: flexible, scalable schedulers for large compute clusters
Mesos Architecture
● Master
○ a mediator between slave
resources and frameworks
○ enables fine-grained sharing of
resources by making resource
offers
● Slave
○ manages resources on physical
node and runs executors
● Framework
○ application that solves a specific
use case
○ Scheduler negotiates with master
and handles resource offers
○ Executors consume resources and
run tasks on slaves
Two-Level Scheduling
● Slave nodes report to Master
amount of available resources
● Allocation module starts offering
resources to frameworks
● Framework receives offers
○ if resources do not satisfy its
needs - rejects the offer
○ if resources satisfy its
demands - creates list of
tasks and sends to master
● Master verifies tasks and forwards
to executor (and launches the
executor if it’s not running)
Resource offer
id: { value: "0cb2328a-61c2-4316-91ef-cbbb6ebbf504-O1" }
framework_id: { value: "0cb2328a-61c2-4316-91ef-cbbb6ebbf504-0001" }
slave_id: { value: "0cb2328a-61c2-4316-91ef-cbbb6ebbf504-S0" }
hostname: "mesos-slave"
resources { name: "cpus", type: SCALAR, scalar { value: 6.0 }, role: "*" }
resources { name: "mem", type: SCALAR, scalar { value: 6762.0 }, role: "*" }
resources { name: "disk", type: SCALAR, scalar { value: 13483.0 }, role: "*" }
resources { name: "ports", type: RANGES, ranges { range { begin: 31000, end: 32000 } }, role: "*" }
url {
scheme: "http"
address {
hostname: "mesos-slave"
ip: "172.18.0.5"
port: 5151
}
path: "/slave(1)"
}
Framework Scheduler
class SomeMesosScheduler extends Scheduler {
override def resourceOffers(driver: SchedulerDriver, offers: List[Offer]): Unit = {
for(offer <- offers){
stateLock.synchronized {
if(isOfferValid(offer)){
val executorInfo = buildExecutorInfo(driver, "Executor A"))
//amount of tasks is calculated to fully use resources from the offer
val tasks = buildTasks(offer, executorInfo)
driver.launchTasks(List(offer.getId), tasks)
} else {
driver.declineOffer(offer.getId)
}
}
}
}
//rest of the methods implementations go here
}
Dominant Resource Fairness (DRF)
● Dominant resource
○ a resource of specific type (cpu, ram, etc.) which is most demanded by a framework among
other resources it needs
○ the resource is identified as a share of the total cluster resources of the same type
● Dominant share
○ a share of dominant resource allocated to a framework in the cluster
● Example:
○ Cluster total: 9 CPU & 18 GB RAM
○ Framework A tasks need < 3 CPU, 1 GB > (or < 33% CPU, 5% RAM >)
○ Framework B tasks need < 1 CPU, 4 GB > (or < 11% CPU, 22% RAM >)
● DRF algorithm computes frameworks’ dominant shares and tries to maximize
the smallest dominant share in the system
DRF Demo
● 3 frameworks with < 8% CPU, 7.5% RAM > demand each
● Framework A < 33% CPU, 15% RAM >, Framework B < 16% CPU, 30% RAM >)
● Framework A < 33% CPU, 15% RAM >, Framework B < 16% CPU, 36% RAM >)
DRF properties
● Sharing incentive
○ Each user should be better off sharing the cluster, than exclusively using her own partition of
the cluster. Consider a cluster with identical nodes and n users. Then a user should not be
able to allocate more tasks in a cluster partition consisting of 1/n of all resources.
● Strategy-proofness
○ Users should not be able to benefit by lying about their resource demands. This provides
incentive compatibility, as a user cannot improve her allocation by lying.
● Envy-freeness
○ A user should not prefer the allocation of another user. This property embodies the notion of
fairness.
● Pareto efficiency
○ It should not be possible to increase the allocation of a user without decreasing the allocation
of at least another user. This property is important as it leads to maximizing system utilization
subject to satisfying the other properties.
source: Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
Resource Reservation
● Goals:
○ allocate all single slave resources to one type of framework
○ divide cluster between several framework types or organisations
○ framework groups prioritization and guaranteed allocation
● Static reservation
○ slave node is configured on start (cannot be reserved for another role or unreserved)
--resources="cpus:4;mem:2048;cpus(spark):8;mem(spark):4096"
● Dynamic reservation
○ resources are reserved/unreserved within a respond to resource offer
Offer::Operation::Reserve
○ MESOS-2018
● Extras:
○ persistent volumes
○ multiple disk resources
Resource Isolation
● Goals:
○ running tasks isolation and capping of runtime resources
○ programmatic control over task resources
○ use images to allow different environments
● Docker containerizer
○ executed tasks are docker containers (e.g. microservices packed in Docker)
● Mesos containerizer (default)
○ Mesos-native (no dependencies on other technologies)
○ provides fine-grained controls (cgroups/namespaces)
○ provides disk usage limits controls
● Composing
○ allows using multiple containerizers together
○ the first containerizer supporting task configuration will be used to launch it
Ubiquitous frameworks: Marathon
● distributed init.d
● long running tasks
execution
● HA mode with ZooKeeper
● Docker executor
● REST API
51
Marathon: launching Chronos in Docker
curl -XPOST 'http://marathon:8080/v2/apps' -H 'Content-Type: application/json' -d '{
"id": "chronos",
"container": {
"type": "DOCKER",
"docker": {
"network": "HOST",
"image": "datastrophic/chronos:mesos-0.27.1-chronos-2.5",
"parameters": [
{ "key": "env", "value": "CHRONOS_HTTP_PORT=4400" },
{ "key": "env", "value": "CHRONOS_MASTER=zk://zookeeper:2181/mesos" },
{ "key": "env", "value": "CHRONOS_ZK_HOSTS=zookeeper:2181"}
]
}
},
"ports": [ 4400 ],
"cpus": 1,
"mem": 512,
"instances": 1
}'
Ubiquitous frameworks: Chronos
● distributed cron
● HA mode with ZooKeeper
● supports graphs of jobs
● sensitive to network failures
53
More Mesos frameworks
● Spark
● Hadoop
● Cassandra
● Kafka
● Myriad: YARN on Mesos
● Storm
● Samza
54
Spark on Mesos
● Coarse-grained mode(default)
○ Spark Executor is launched one per Slave
and acquires all available cores in cluster
○ Tasks are scheduled by Spark relying on its
RPC mechanism (Akka)
● Fine-grained mode
○ Spark Executor is launched one per Slave
with minimal resources needed (1 core)
○ Spark tasks are executed as Mesos tasks
and use Mesos semantics
Spark on Mesos
● Coarse-grained mode
● Fine-grained mode
/opt/spark/bin/spark-submit 
--class io.datastrophic.demo.SparkJob 
--master mesos://zk://zookeeper:2181/mesos 
--conf "spark.cores.max=10" 
/opt/jobs/spark-jobs-assembly.jar
/opt/spark/bin/spark-submit 
--class io.datastrophic.demo.SparkJob 
--master mesos://zk://zookeeper:2181/mesos 
--conf "spark.mesos.coarse=false"
/opt/jobs/spark-jobs-assembly.jar
Spark on Mesos vs. YARN
● Mesos (coarse-grained)
● YARN
/opt/spark/bin/spark-submit 
--class io.datastrophic.demo.SparkJob 
--master mesos://zk://zookeeper:2181/mesos 
--conf "spark.cores.max=100" 
/opt/jobs/spark-jobs-assembly.jar
/opt/spark/bin/spark-submit 
--class io.datastrophic.demo.SparkJob 
--master yarn 
--num-executors 25 
--executor-cores 4 
/opt/jobs/spark-jobs-assembly.jar
Running Spark via Marathon
curl -XPOST 'http://marathon:8080/v2/apps' -H 'Content-Type: application/json' -d '{
"cmd": "/opt/spark/bin/spark-submit
--class io.datastrophic.demo.SparkJob
--master mesos://zk://zookeeper:2181/mesos
--deploy-mode client
/opt/jobs/spark-jobs-assembly.jar",
"id": "spark-pi",
"cpus": 1,
"mem": 1024,
"instances": 1
}'
Running Spark via Chronos
curl -L -H 'Content-Type: application/json' -X POST http://mesos:4400/scheduler/iso8601 -d '{
"name": "Scheduled Spark Submit Job",
"/opt/spark/bin/spark-submit
--class io.datastrophic.demo.SparkJob
--master mesos://zk://zookeeper:2181/mesos
/opt/jobs/spark-jobs-assembly.jar",
"shell": true,
"async": false,
"cpus": 0.1,
"disk": 256,
"mem": 1024,
"owner": "anton@datastrophic.io",
"description": "Spark Job executed every 3 minutes",
"schedule": "R/2016-03-14T12:35:00.000Z/PT3M"
}'
Spark deployment strategies
● Binaries distribution
○ every node in the cluster must have Spark libraries installed in the same locations
○ pros: easy to start with
○ cons: hard to upgrade, hard to have several Spark versions simultaneously
● Edge nodes
○ use nodes with specific environment setup which are reachable from Mesos cluster and keep
Spark executor jars in accessible location like S3, HTTP or HDFS
○ pros: easy to use multiple Spark versions, minimal dependencies on Mesos
○ cons: hard to maintain in case of multi-tenancy
● Dockerized environment
○ Instead of edge nodes use Docker containers with environment configured for specific needs
(hosts still need to be reachable from Mesos cluster) and use Docker Spark executor
○ pros: highly isolated environments for specific needs, could be upgraded independently, zero
impact on cluster nodes
○ cons: could be hard to properly setup and configure
Mesos Framework Walkthrough
● Throttler
○ a demo framework for load testing Cassandra
○ load intensity is controlled by parameters: total queries, queries per task and
parallelism (how many Mesos tasks to run in parallel)
● Goals
○ take a look at working (simple) Mesos application
○ see how Scheduler, Executor and framework launcher could be implemented
● Sources:
○ source code and dockerized Mesos cluster configuration are available at
github/datastrophic/mesos-workshop
○ all the examples (and even more) available as well
Questions
@antonkirillov datastrophic.io

More Related Content

What's hot

Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
Knoldus Inc.
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
rhatr
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
Anant Rustagi
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
DataWorks Summit
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Timothy Spann
 

What's hot (20)

Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 

Similar to Data processing platforms with SMACK: Spark and Mesos internals

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
Datio Big Data
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
Martin Zapletal
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
Kyle Burke
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 

Similar to Data processing platforms with SMACK: Spark and Mesos internals (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 

Recently uploaded

一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
zoykygu
 
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
mparmparousiskostas
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
yuvishachadda
 
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
wwefun9823#S0007
 
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
jasodak99
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
incitbe
 
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in LucknowCall Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
hiju9823
 
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
Ak47
 
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
gebegu
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
nitachopra
 
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Boston Institute of Analytics
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
Douglas Day
 
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
meenusingh4354543
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
ranjeet3341
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
PsychoTech Services
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
prijesh mathew
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
ThinkInnovation
 

Recently uploaded (20)

一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
 
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
 
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
🔥Night Call Girls Pune 💯Call Us 🔝 7014168258 🔝💃Independent Pune Escorts Servi...
 
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
 
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
 
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in LucknowCall Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
 
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
9711199012⎷❤✨ Call Girls RK Puram Special Price with a special young
 
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
 
Product Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer PreferencesProduct Cluster Analysis: Unveiling Hidden Customer Preferences
Product Cluster Analysis: Unveiling Hidden Customer Preferences
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
❻❸❼⓿❽❻❷⓿⓿❼KALYAN MATKA CHART FINAL OPEN JODI PANNA FIXXX DPBOSS MATKA RESULT ...
 
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
 
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
Erotic Call Girls Hyderabad🫱9352988975🫲 High Quality Call Girl Service Right ...
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
 

Data processing platforms with SMACK: Spark and Mesos internals

  • 1. Decomposing SMACK Stack Spark & Mesos Internals Anton Kirillov Apache Spark Meetup intro by Sebastian Stoll Oooyala, March 2016
  • 2. Who is this guy? @antonkirillov ● Staff Engineer in Data Team @ Ooyala ● Scala programmer ● Focused on distributed systems ● Building data platforms with SMACK/Hadoop ● Ph.D. in Computer Science ● blog: datastrophic.io ● github: github.com/datastrophic
  • 3. Roadmap ● Intro to Ooyala stack ● SMACK stack ○ Overview ○ Architecture design options ● Spark ○ Core concepts & execution workflow ○ Architecture ● Mesos ○ Cluster resource management ○ Architecture and scheduling ○ Frameworks ○ Spark on Mesos
  • 4.
  • 5.
  • 6.
  • 7.
  • 8. SMACK Stack Overview components and architecture designs
  • 9. SMACK Stack ● Spark - a generalized framework for distributed data processing supporting in-memory data caching and reuse across computations ● Mesos - cluster resource management system that provides efficient resource isolation and sharing across distributed applications ● Akka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM ● Cassandra - distributed, highly available database designed to handle large amounts of data across multiple datacenters ● Kafka - a high-throughput, low-latency distributed messaging system designed for handling real-time data feeds
  • 10. Storage Layer: Cassandra ● Pros: ○ optimized for heavy write loads ○ configurable CA (CAP) ○ linearly scalable ○ XDCR support ○ easy cluster resizing and inter-DC data migration ● Cons: ○ data model (distributed nested sorted map) ○ designed for fast serving but not batch processing ○ not well-suited for ad-hoc queries against historical raw data
  • 11. Fixing NoSQL limitations with Spark //joining raw events with rolled-up and grouping by type sqlContext.sql {""" SELECT events.campaignId, events.eventType, events.value + campaigns.total as total_events FROM events JOIN campaigns ON events.campaignId = campaigns.id AND events.eventType = campaigns.eventType """.stripMargin }.registerTempTable("joined") sqlContext.sql {""" SELECT campaignId, eventType, sum(total_events) as total FROM joined GROUP BY campaignId, eventType """.stripMargin }.saveAsCassandraTable(”keyspace”, ”totals”)
  • 12. Architecture of Spark/Cassandra Clusters Separate Write & Analytics: ● clusters can be scaled independently ● data is replicated by Cassandra asynchronously ● Analytics has different Read/Write load patterns ● Analytics contains additional data and processing results ● Spark resource impact limited to only one DC To fully facilitate Spark-C* connector data locality awareness, Spark workers should be collocated with Cassandra nodes (gotcha: CL=ONE)
  • 13. Mesos as Spark cluster manager ● fine-grained resource sharing between Spark and other applications ● scalable partitioning between multiple instances of Spark ● unified platform for running various applications (frameworks) ● fault-tolerant and scalable
  • 14. Stream Processing with Kafka and Spark ● be prepared for failures and broken data ● backup and patching strategies should be designed upfront ● patch/restore if time interval could be done by replay if store is idempotent
  • 15. Spark Streaming with Kafka val streamingContext = new StreamingContext(sc.getConf, Seconds(10)) val eventStream = KafkaUtils.createStream( ssc = streamingContext, zkQuorum = "zoo01,zoo02,zoo03", groupId = "spark_consumer", topics = Map("raw_events" -> 3) ) eventStream.map(_.toEvent) .saveToCassandra(keyspace, table) streamingContext.start() streamingContext.awaitTermination()
  • 16. Data Ingestion with Akka ● actor model implementation for JVM ● message-based and asynchronous ● easily scalable from one process to cluster of machines ● actor hierarchies with parental supervision ● easily packages in Docker to be run on Mesos
  • 17. Akka Http microservice val config = new ProducerConfig(KafkaConfig()) lazy val producer = new KafkaProducer[A, A](config) val routes: Route = { post{ decodeRequest{ entity(as[String]){ str => JsonParser.parse(str).validate[Event] match { case s: JsSuccess[String] => producer.send(new KeyedMessage(topic, str)) system.actorOf(Props[CassandraWriter]) ! s.get case e: JsError => BadRequest -> JsError.toFlatJson(e).toString() } } } } } object AkkaHttpMicroservice extends App with Service { Http().bindAndHandle(routes, config.getString("http.interface"), config.getInt("http.port")) }
  • 18. Writing to Cassandra with Akka class CassandraWriterActor extends Actor with ActorLogging { //for demo purposes, session initialized here val session = Cluster.builder() .addContactPoint("cassandra.host") .build() .connect() override def receive: Receive = { case event: Event => val statement = new SimpleStatement(event.createQuery) .setConsistencyLevel(ConsistencyLevel.QUORUM) Try(session.execute(statement)) match { case Failure(ex) => //error handling code case Success => sender ! WriteSuccessfull } } }
  • 19. Lambda Architecture with SMACK ● when design meets reality it’s hard to implement canonical architecture ● depending on the use case it’s easy to implement Kappa architecture as well
  • 20. SMACK stack: ● concise toolbox for wide variety of data processing scenarios ● battle-tested and widely used software with large communities ● easy scalability and replication of data while preserving low latencies ● unified cluster management for heterogeneous loads ● single platform for any kind of applications ● implementation platform for different architecture designs ● really short time-to-market (e.g. for MVP verification)
  • 21. Apache Spark in Depth core concepts, architecture & internals
  • 22. Meet Spark ● Generalized framework for distributed data processing (batch, graph, ML) ● Scala collections functional API for manipulating data at scale ● In-memory data caching and reuse across computations ● Applies set of coarse-grained transformations over partitioned data ● Failure recovery relies on lineage to recompute failed tasks ● Supports majority of input formats and integrates with Mesos / YARN
  • 23. Spark makes data engineers happy Backup/restore of Cassandra tables in Parquet def backup(config: Config) { sc.cassandraTable(config.keyspace, config.table).map(_.toEvent).toDF() .write.parquet(config.path) } def restore(config: Config) { sqlContext.read.parquet(config.path) .map(_.toEvent).saveToCassandra(config.keyspace, config.table) } Query different data sources to identify discrepancies sqlContext.sql { """ SELECT count() FROM cassandra_event_rollups JOIN mongo_event_rollups ON cassandra_event_rollups.uuid = cassandra_event_rollups.uuid WHERE cassandra_event_rollups.value != cassandra_event_rollups.value """.stripMargin }
  • 25. RDD: Resilient Distributed Dataset ● A fault-tolerant, immutable, parallel data structure ● Provides API for ○ manipulating the collection of elements (transformations and materialization) ○ persisting intermediate results in memory for later reuse ○ controlling partitioning to optimize data placement ● Can be created through deterministic operation ○ from storage (distributed file system, database, plain file) ○ from another RDD ● Stores information about parent RDDs ○ for execution optimization and operations pipelining ○ to recompute the data in case of failure
  • 26. RDD: a developer’s view ● Distributed immutable data + lazily evaluated operations ○ partitioned data + iterator ○ transformations & actions ● An interface defining 5 main properties a list of partitions (e.g. splits in Hadoop) def getPartitions: Array[Partition] a list of dependencies on other RDDs def getDependencies: Seq[Dependency[_]] a function for computing each split def compute(split: Partition, context: TaskContext): Iterator[T] (optional) a list of preferred locations to compute each split on def getPreferredLocations(split: Partition): Seq[String] = Nil (optional) a partitioner for key-value RDDs val partitioner: Option[Partitioner] = None lineage execution optimization
  • 27. RDDs Example ● HadoopRDD ○ getPartitions = HDFS blocks ○ getDependencies = None ○ compute = load block in memory ○ getPrefferedLocations = HDFS block locations ○ partitioner = None ● MapPartitionsRDD ○ getPartitions = same as parent ○ getDependencies = parent RDD ○ compute = compute parent and apply map() ○ getPrefferedLocations = same as parent ○ partitioner = None sparkContext.textFile("hdfs://...")
  • 28. RDD Operations ● Transformations ○ apply user function to every element in a partition (or to the whole partition) ○ apply aggregation function to the whole dataset (groupBy, sortBy) ○ introduce dependencies between RDDs to form DAG ○ provide functionality for repartitioning (repartition, partitionBy) ● Actions ○ trigger job execution ○ used to materialize computation results ● Extra: persistence ○ explicitly store RDDs in memory, on disk or off-heap (cache, persist) ○ checkpointing for truncating RDD lineage
  • 29. Execution workflow 29 rdd1.join(rdd2) .groupBy(...) .filter(...) splits graph into stages of tasks submits each stage as ready launches tasks via cluster manager retries failed or struggling tasks executes tasks stores and serves blocks
  • 30. Code sample: joining aggregated and raw data //aggregate events after specific date for given campaign val events = sc.cassandraTable("demo", "event") .map(_.toEvent) .filter(event => event.campaignId == campaignId && event.time.isAfter(watermark)) .keyBy(_.eventType) .reduceByKey(_ + _) .cache() //aggregate campaigns by type val campaigns = sc.cassandraTable("demo", "campaign") .map(_.toCampaign) .filter(campaign => campaign.id == campaignId && campaign.time.isBefore(watermark)) .keyBy(_.eventType) .reduceByKey(_ + _) .cache() //joined rollups and raw events val joinedTotals = campaigns.join(events) .map { case (key, (campaign, event)) => CampaignTotals(campaign, event) } .collect() //count totals separately val eventTotals = events.map{ case (t, e) => s"$t -> ${e.value}" }.collect() val campaignTotals = campaigns.map{ case (t, e) => s"$t -> ${e.value}" }.collect()
  • 31. DAG
  • 32. Dependency types ● Narrow (pipelineable) ○ each partition of the parent RDD is used by at most one partition of the child RDD ○ allow for pipelined execution on one cluster node ○ failure recovery is more efficient as only lost parent partitions need to be recomputed ● Wide (shuffle) ○ multiple child partitions may depend on one parent partition ○ require data from all parent partitions to be available and to be shuffled across the nodes ○ if some partition is lost from all the ancestors a complete recomputation is needed
  • 33. Stages and Tasks ● Stages breakdown strategy ○ check backwards from final RDD ○ add each “narrow” dependency to the current stage ○ create new stage when there’s a shuffle dependency ● Tasks ○ ShuffleMapTask partitions its input for shuffle ○ ResultTask sends its output to the driver
  • 34. Shuffle ● Shuffle Write ○ redistributes data among partitions and writes files to disk ○ each shuffle task creates one file with regions assigned to reducer ○ sort shuffle uses in-memory sorting with spillover to disk to get final result ● Shuffle Read ○ fetches the files and applies reduce() logic ○ if data ordering is needed then it is sorted on “reducer” side for any type of shuffle
  • 35. Sort Shuffle ● Incoming records accumulated and sorted in memory according their target partition ids ● Sorted records are written to file or multiple files if spilled and then merged ● index file stores offsets of the data blocks in the data file ● Sorting without deserialization is possible under certain conditions (SPARK-7081)
  • 37. Memory Management in Spark 1.6 ● Execution Memory ○ storage for data needed during tasks execution ○ shuffle-related data ● Storage Memory ○ storage of cached RDDs and broadcast variables ○ possible to borrow from execution memory (spill otherwise) ○ safeguard value is 0.5 of Spark Memory when cached blocks are immune to eviction ● User Memory ○ user data structures and internal metadata in Spark ○ safeguarding against OOM ● Reserved memory ○ memory needed for running executor itself and not strictly related to Spark
  • 38. Execution Modes ● spark-shell --master [ local | spark | yarn-client | mesos] ○ launches REPL connected to specified cluster manager ○ always runs in client mode ● spark-submit --master [ local | spark:// | mesos:// | yarn ] spark-job.jar ○ launches assembly jar on the cluster ● Masters ○ local[k] - run Spark locally with K worker threads ○ spark - launches driver app on Spark Standalone installation ○ mesos - driver will spawn executors on Mesos cluster (deploy-mode: client | cluster) ○ yarn - same idea as with Mesos (deploy-mode: client | cluster) ● Deploy Modes ○ client - driver executed as a separate process on the machine where it has been launched and spawns executors ○ cluster - driver launched as a container using underlying cluster manager
  • 40. Cluster Resource Managers: Requirements ● Efficiency ○ efficient sharing of resources across applications ○ utilization of cluster resources in the most optimal manner ● Flexibility ○ support of wide array of current and future frameworks ○ dealing with hardware heterogeneity ○ support of resource requests of different types ● Scalability ○ scaling to clusters of dozens of thousands of nodes ○ scheduling system’s response times must remain acceptable while increasing number of machines and applications ● Robustness ○ fault-tolerant guarantees for the system and applications ○ high availability of central scheduler component
  • 41. Cluster Manager Architectures source: Omega: flexible, scalable schedulers for large compute clusters
  • 42. Mesos Architecture ● Master ○ a mediator between slave resources and frameworks ○ enables fine-grained sharing of resources by making resource offers ● Slave ○ manages resources on physical node and runs executors ● Framework ○ application that solves a specific use case ○ Scheduler negotiates with master and handles resource offers ○ Executors consume resources and run tasks on slaves
  • 43. Two-Level Scheduling ● Slave nodes report to Master amount of available resources ● Allocation module starts offering resources to frameworks ● Framework receives offers ○ if resources do not satisfy its needs - rejects the offer ○ if resources satisfy its demands - creates list of tasks and sends to master ● Master verifies tasks and forwards to executor (and launches the executor if it’s not running)
  • 44. Resource offer id: { value: "0cb2328a-61c2-4316-91ef-cbbb6ebbf504-O1" } framework_id: { value: "0cb2328a-61c2-4316-91ef-cbbb6ebbf504-0001" } slave_id: { value: "0cb2328a-61c2-4316-91ef-cbbb6ebbf504-S0" } hostname: "mesos-slave" resources { name: "cpus", type: SCALAR, scalar { value: 6.0 }, role: "*" } resources { name: "mem", type: SCALAR, scalar { value: 6762.0 }, role: "*" } resources { name: "disk", type: SCALAR, scalar { value: 13483.0 }, role: "*" } resources { name: "ports", type: RANGES, ranges { range { begin: 31000, end: 32000 } }, role: "*" } url { scheme: "http" address { hostname: "mesos-slave" ip: "172.18.0.5" port: 5151 } path: "/slave(1)" }
  • 45. Framework Scheduler class SomeMesosScheduler extends Scheduler { override def resourceOffers(driver: SchedulerDriver, offers: List[Offer]): Unit = { for(offer <- offers){ stateLock.synchronized { if(isOfferValid(offer)){ val executorInfo = buildExecutorInfo(driver, "Executor A")) //amount of tasks is calculated to fully use resources from the offer val tasks = buildTasks(offer, executorInfo) driver.launchTasks(List(offer.getId), tasks) } else { driver.declineOffer(offer.getId) } } } } //rest of the methods implementations go here }
  • 46. Dominant Resource Fairness (DRF) ● Dominant resource ○ a resource of specific type (cpu, ram, etc.) which is most demanded by a framework among other resources it needs ○ the resource is identified as a share of the total cluster resources of the same type ● Dominant share ○ a share of dominant resource allocated to a framework in the cluster ● Example: ○ Cluster total: 9 CPU & 18 GB RAM ○ Framework A tasks need < 3 CPU, 1 GB > (or < 33% CPU, 5% RAM >) ○ Framework B tasks need < 1 CPU, 4 GB > (or < 11% CPU, 22% RAM >) ● DRF algorithm computes frameworks’ dominant shares and tries to maximize the smallest dominant share in the system
  • 47. DRF Demo ● 3 frameworks with < 8% CPU, 7.5% RAM > demand each ● Framework A < 33% CPU, 15% RAM >, Framework B < 16% CPU, 30% RAM >) ● Framework A < 33% CPU, 15% RAM >, Framework B < 16% CPU, 36% RAM >)
  • 48. DRF properties ● Sharing incentive ○ Each user should be better off sharing the cluster, than exclusively using her own partition of the cluster. Consider a cluster with identical nodes and n users. Then a user should not be able to allocate more tasks in a cluster partition consisting of 1/n of all resources. ● Strategy-proofness ○ Users should not be able to benefit by lying about their resource demands. This provides incentive compatibility, as a user cannot improve her allocation by lying. ● Envy-freeness ○ A user should not prefer the allocation of another user. This property embodies the notion of fairness. ● Pareto efficiency ○ It should not be possible to increase the allocation of a user without decreasing the allocation of at least another user. This property is important as it leads to maximizing system utilization subject to satisfying the other properties. source: Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
  • 49. Resource Reservation ● Goals: ○ allocate all single slave resources to one type of framework ○ divide cluster between several framework types or organisations ○ framework groups prioritization and guaranteed allocation ● Static reservation ○ slave node is configured on start (cannot be reserved for another role or unreserved) --resources="cpus:4;mem:2048;cpus(spark):8;mem(spark):4096" ● Dynamic reservation ○ resources are reserved/unreserved within a respond to resource offer Offer::Operation::Reserve ○ MESOS-2018 ● Extras: ○ persistent volumes ○ multiple disk resources
  • 50. Resource Isolation ● Goals: ○ running tasks isolation and capping of runtime resources ○ programmatic control over task resources ○ use images to allow different environments ● Docker containerizer ○ executed tasks are docker containers (e.g. microservices packed in Docker) ● Mesos containerizer (default) ○ Mesos-native (no dependencies on other technologies) ○ provides fine-grained controls (cgroups/namespaces) ○ provides disk usage limits controls ● Composing ○ allows using multiple containerizers together ○ the first containerizer supporting task configuration will be used to launch it
  • 51. Ubiquitous frameworks: Marathon ● distributed init.d ● long running tasks execution ● HA mode with ZooKeeper ● Docker executor ● REST API 51
  • 52. Marathon: launching Chronos in Docker curl -XPOST 'http://marathon:8080/v2/apps' -H 'Content-Type: application/json' -d '{ "id": "chronos", "container": { "type": "DOCKER", "docker": { "network": "HOST", "image": "datastrophic/chronos:mesos-0.27.1-chronos-2.5", "parameters": [ { "key": "env", "value": "CHRONOS_HTTP_PORT=4400" }, { "key": "env", "value": "CHRONOS_MASTER=zk://zookeeper:2181/mesos" }, { "key": "env", "value": "CHRONOS_ZK_HOSTS=zookeeper:2181"} ] } }, "ports": [ 4400 ], "cpus": 1, "mem": 512, "instances": 1 }'
  • 53. Ubiquitous frameworks: Chronos ● distributed cron ● HA mode with ZooKeeper ● supports graphs of jobs ● sensitive to network failures 53
  • 54. More Mesos frameworks ● Spark ● Hadoop ● Cassandra ● Kafka ● Myriad: YARN on Mesos ● Storm ● Samza 54
  • 55. Spark on Mesos ● Coarse-grained mode(default) ○ Spark Executor is launched one per Slave and acquires all available cores in cluster ○ Tasks are scheduled by Spark relying on its RPC mechanism (Akka) ● Fine-grained mode ○ Spark Executor is launched one per Slave with minimal resources needed (1 core) ○ Spark tasks are executed as Mesos tasks and use Mesos semantics
  • 56. Spark on Mesos ● Coarse-grained mode ● Fine-grained mode /opt/spark/bin/spark-submit --class io.datastrophic.demo.SparkJob --master mesos://zk://zookeeper:2181/mesos --conf "spark.cores.max=10" /opt/jobs/spark-jobs-assembly.jar /opt/spark/bin/spark-submit --class io.datastrophic.demo.SparkJob --master mesos://zk://zookeeper:2181/mesos --conf "spark.mesos.coarse=false" /opt/jobs/spark-jobs-assembly.jar
  • 57. Spark on Mesos vs. YARN ● Mesos (coarse-grained) ● YARN /opt/spark/bin/spark-submit --class io.datastrophic.demo.SparkJob --master mesos://zk://zookeeper:2181/mesos --conf "spark.cores.max=100" /opt/jobs/spark-jobs-assembly.jar /opt/spark/bin/spark-submit --class io.datastrophic.demo.SparkJob --master yarn --num-executors 25 --executor-cores 4 /opt/jobs/spark-jobs-assembly.jar
  • 58. Running Spark via Marathon curl -XPOST 'http://marathon:8080/v2/apps' -H 'Content-Type: application/json' -d '{ "cmd": "/opt/spark/bin/spark-submit --class io.datastrophic.demo.SparkJob --master mesos://zk://zookeeper:2181/mesos --deploy-mode client /opt/jobs/spark-jobs-assembly.jar", "id": "spark-pi", "cpus": 1, "mem": 1024, "instances": 1 }'
  • 59. Running Spark via Chronos curl -L -H 'Content-Type: application/json' -X POST http://mesos:4400/scheduler/iso8601 -d '{ "name": "Scheduled Spark Submit Job", "/opt/spark/bin/spark-submit --class io.datastrophic.demo.SparkJob --master mesos://zk://zookeeper:2181/mesos /opt/jobs/spark-jobs-assembly.jar", "shell": true, "async": false, "cpus": 0.1, "disk": 256, "mem": 1024, "owner": "anton@datastrophic.io", "description": "Spark Job executed every 3 minutes", "schedule": "R/2016-03-14T12:35:00.000Z/PT3M" }'
  • 60. Spark deployment strategies ● Binaries distribution ○ every node in the cluster must have Spark libraries installed in the same locations ○ pros: easy to start with ○ cons: hard to upgrade, hard to have several Spark versions simultaneously ● Edge nodes ○ use nodes with specific environment setup which are reachable from Mesos cluster and keep Spark executor jars in accessible location like S3, HTTP or HDFS ○ pros: easy to use multiple Spark versions, minimal dependencies on Mesos ○ cons: hard to maintain in case of multi-tenancy ● Dockerized environment ○ Instead of edge nodes use Docker containers with environment configured for specific needs (hosts still need to be reachable from Mesos cluster) and use Docker Spark executor ○ pros: highly isolated environments for specific needs, could be upgraded independently, zero impact on cluster nodes ○ cons: could be hard to properly setup and configure
  • 61. Mesos Framework Walkthrough ● Throttler ○ a demo framework for load testing Cassandra ○ load intensity is controlled by parameters: total queries, queries per task and parallelism (how many Mesos tasks to run in parallel) ● Goals ○ take a look at working (simple) Mesos application ○ see how Scheduler, Executor and framework launcher could be implemented ● Sources: ○ source code and dockerized Mesos cluster configuration are available at github/datastrophic/mesos-workshop ○ all the examples (and even more) available as well
  翻译: