World Artificial Intelligence Conference Shanghai 2018

The Next Gen AI Infrastructure
for the Public AI Cloud
By Adam Gibson

Our community software gets
160,000 downloads per month,
used by teams in half of the
Fortune 500.
About Skymind
● Builds AI infrastructure for operating models in production.
● Allows model access from cloud, server, desktop, and mobile, providing tooling for models
such as revision history and accuracy monitoring over time.
● Created the widely used open-source AI framework Deeplearning4j, powering AI for large
enterprise globally, from banking to e-commerce.
SKIL:
ML and DL Model Server
SKIL Discover:
ML and DL Validation
& Training Tool
Products

Some of the companies that own Core AI
technologies
Less than 5% of businesses
globally drives value from AI
The Hierarchy of AI

Some of the companies that own Core AI
technologies
Less than 5% of businesses
globally drives value from AI
Pyramid of AI

● AI at most a buzzword.
● Lacks basic infrastructure to derive value from AI
such as basic IT infrastructure.
Pre Digital Transformation
● Executives still question value of AI for their
business. Often skeptical of benefits.
● Wants to see benefits almost immediately before
real investment.
LEVEL 4: Heard of AI
What is A.I.?

● Has static rules in place.
● Deployed dashboards and BI, calls it AI.
● Very little if any modern use of machine learning.
● If any machine learning at all, probably
has it as a checkbox more than capturing
value.
● May have a data scientist or 2 lacking
infrastructure to do job well.
Level 3: Everything’s AI

● Capturing value from machine learning.
● Produces models meaningful to business.
● Has centralized infrastructure for analyzing
data within line of business.
● Invested in AI but may not know total return
on investment.
● Often building models and running
experiments without oversight from
business.
● Uses but does not build own infrastructure.
Credit: Mckinsey Global Institute
Level 2: Adopted AI

● Has own AI tools written from scratch
● Often has products powered by AI
● Software is a core competency
● Often has AI R&D lab
● Probably sells cloud infrastructure or dev tools
● Often employs vast majority of AI talent
Level 1: Mastered AI

Components to Build AI Infrastructure

The Infrastructure
Platform-agnostic
● Public Cloud
● On-Prem
● Hybrid
● Embeddable
● Edge
ML algorithms and Infra should go to
wherever the data is and computer.
● Configurable
● Auto-scaling
● Legacy Integration
● Multi-Cloud Flexibility

Typical Development
DEFINE PROBLEM
ACQUIRE DATA
TRANSFORM DATA
TRAIN MODEL
VALIDATE MODEL
REPEAT

Data Storage
● As organizations prepare enterprise AI strategies and build the necessary
infrastructure, storage must be a top priority. That includes ensuring the
proper storage capacity, IOPS and reliability to deal with the massive data
amounts required for effective AI.
● AI applications depend on source data, so an organization needs to know
where the source data resides and how AI applications will use it.
● As databases grow over time, companies need to
monitor capacity and plan for expansion as
needed.

Networking Infrastructure
● In order to provide the high efficiency at scale required to support AI,
organizations will likely need to upgrade their networks.
● Scalability must be a high priority, and that will require high-bandwidth,
low-latency and creative architectures
● Intent-based networks that can anticipate network demands
or security threats and react in real-time.

Data Processing
● A CPU-based environment can handle basic AI workloads, but deep
learning involves multiple large data sets and deploying scalable neural
network algorithms. For that, CPU-based computing might not be
sufficient.
● Deploying GPUs enables organizations to optimize their data center
infrastructure and gain power efficiency.

Data Management and Governance
● Does the organization have the proper mechanisms in place to deliver data
in a secure and efficient manner to the users who need it?
● Should be accessible from a variety of endpoints, including mobile devices
via wireless networks.
● Data access controls: privacy and security issues

Model Training
Main Steps
● Read Data from Source.
● Analyze with statistics and normalize for
Neural Network Input.
● Train by sending input into Neural
Network and calculating how to update
network weights by using Back-
propagation Algorithm.
● Repeat until model makes no more
improvements.
Problems
● Model learns better with large dataset.
In enterprise, sometimes this data
doesn’t fit on a single machine .

Model Training: Multi-Node Training Cluster
Scaled Out Training Cluster
Architecture
● Any midrange VM or dedicated machine for
Zookeeper
● 1 or more Multi-GPU systems (DGX class or
similar) for SKIL
● Gluster/HDFS provides global file system for
data

Model Training: Hybrid Cloud
GPU Training Cluster
Architecture
● GPU Cluster (i.e. DGX-1 servers)
● Existing Hadoop cluster is used for
○ ETL (Preparing data for training on GPU) or
○ Batch Inference for distributed scoring with
trained models.

Model Training: Multi Cluster
GPU Training Cluster
CPU Inference Cluster
Architecture
● Powerful GPU Servers or Spark Cluster for training
models.
● Separate (multiple) deployments-only clusters for
production deployments of ML models as REST
APIs.

Model Training: Batch Training with Spark
The flow largely divided into two
stages:
● Scheduling: Launch executors
through cluster manager
● Execution: Manage executors to
perform task

Model Training: Batch Training with Spark
AI

Model Training: Work Distribution Across Executors

Model Training: Single Machine vs Spark Cluster
• Total runtime on cluster (including
evaluation) was about ~1.1 hours
• Linear scaling over dozens of nodes in
Spark cluster

Model Deployment: The Applications
REST API
RPA
Application

Model Deployment: Deployment | Application

Model Deployment: Deployments
● Manage Model Deployment through API: Inspecting, updating, removing
models and deployments
GET, POST, or DELETE /deployments
● Each deployment can be assigned to an ID, i.e. “deploymentID” - you
can GET, POST, or DELETE by referencing this ID.
GET, POST, or DELETE /models
● Each model can be assigned an ID,
i.e. “modelID” - you can GET, POST,
or DELETE by referencing this ID.

Model Deployment: Inference
Real-Time (REST Endpoint)
● Standard RESTful API. All requests and responses use the ubiquitous JSON format. Our model
server also supports binary multi-part uploads of images in their compressed representation
minimizing network overhead.
Transform Endpoints
● Allow deploy previously defined transforms to enable distribution in a microservice architecture
(CSV or Image only). The transform is exposed as its own independent endpoint
KNN Endpoints
● Support uploading a series of vectors and looking up their nearest neighbors for recommendation
or clustering use-cases. This is implemented in an efficient manner with the VPTree data structure.

Model Deployment: Batch Inference with Spark
Provides a batch inference feature
through its “Context” for running local
inferences on data stored in your
Hadoop/Spark clusters, minimizing
data movement.

Model Deployment: Batch Inference with Spark
AI
Layer

Model Deployment: Asynchronous (Message Queue/Webhook)
● State of the art model server can receive requests from Message Queues like Kafka or RabbitMQ
to provide high-throughput near-realtime predictions.
● Message Queue is an asynchronous service-to-service communication
● Messages are stored in the queue until they are processed and deleted
● Message queue can be configured to be use on
○ New model server for reading data and storing inference results
○ Notebook to gather data from an incoming feedback queue and new data queue

Model Deployment: Asynchronous (Message Queue/Webhook)
● Apache Kafka is a data streaming platform with publish-subscribe messaging pattern.
● Topic is the queue of messages where it is broken into partitions for speed, scalability and size.
Apache Kafka Cluster
Partition 2
Partition 1
Topic

Model Deployment: | Publish-Subscribe Model in Kafka
Data Streams (Websites -> Model Servers)
Data Streams (Model Servers -> Websites)
Kafka Cluster
Topic
..
Website 1
Website 2
Consumers
.
.
.
Model Server 1
Model Server 2
Producers
PublishSubscribe
Kafka Cluster
Topic
..
Website 1
Website 2
Producers
.
.
.
Model Server 1
Model Server 2
Consumers
SubscribePublish

Inferences - Traditional Way
Manually invoking jobs and handling model deployment - making model management difficult.

Inferences - Key Component of Model Management
Jobs Model History Server Deployment Server
SKIL servers inside a tenant are triggered
to run job/scripts with specific parameters
on tenant resources.
Keeps lists of models with performance
results. APIs allow them to be compared to
report the best models for deployment.
Deployment server handles
deployment, scaling and versioning
of models.
Real-time feedback requests
are stored back in DB to monitor
model performance on real data stream
Best model on real data can
Be used with transfer learning
To fine tune model with latest
data.

Inferences - Model Portfolio
Each portfolio need comprised of
● Deployed model
● Model versioning information
● Performances over time
● Log files
Benefits
● Compliant with GDPR
● Control the granularity of each portfolio
● Track concept drift

Performance: Goal
To be the most flexible and the highest performance
model server available, while also being memory efficient,
allowing for higher model-to-server ratios.

Performance: Key for Big Data Clusters
● Javacpp for memory management
● We have our own Garbage Collection for
CUDA and CUDNN as well (JIT collection on
GPU by tracking references via JVM)
● If on cluster Run everything as spark job
● Works with Keras Imported models
● Runs a parameter server for gradient
sharing with near linear scaling
performance

Inferences: Server Performance
Python’s servers are bottlenecked by Python’s GIL and are essentially single-threaded. Many
implementations process request 1 inputs at a time

Inferences: Server Performance
If you run multiple Python servers to overcome the GIL, you get uncoordinated and delayed responses
time because the processes compete for CPU/GPU.

Inferences Topology
Assessing the performance of your production
cluster requires analyzing the entire topology.
Trade-offs and design decisions can impact
your latency and hardware requirements.
For example, deploying a simple neural network
can significantly impact your cost efficiency on
GPU hardware. Also input data size can add
significant network latency.

Components of Latency
● Input data source gathering
● Transform data in to suitable representation (all numerical representations) for scoring
● NDArray creation on GPUs or in Memory
● Run ndarray through neural network (feedforward)
● Interpret output
Externalities not covered above include SSD vs. HDD, network overhead, network hardware,
virtualization vs. bare metal, Docker’s network host, additional load balancers...

Objective Oriented Infrastructure

The State of The Art Model
Configure Models
Tensorflow, Deeplearning4j, Keras
Train Models
GPU, CPU, Local, Distributed
Deploy
Single Machine/ Cluster,
HTTP API
Import Model
Tensorflow, Deeplearning4j,
PyTorch, Caffe, Keras
Record Feedback
Model History Server

Credit: Mckinsey Global Institute
● Use Cases/Sources of value
● Data Ecosystem
● Techniques and Tools
● Workflow Integration
● Open culture and organization
Going up in Level: Components of AI

Credit: CBS Insights
● Use cases are what maps value of AI to
line of business
● Often well understood per vertical, but not
clear how to map to specific company
● Companies often lacking data collection
needed to implement standard use cases
● Hard to map use case on to
implementation
Expectations on AI Use Cases

● Executives not sure on value of AI
● Often pre digital transformation (scattered
IT infrastructure)
● Often expect ROI while allocating minimal
cost towards innovation
● Need education on even the most basic
applications of AI
Problems in the Industry Today for Laggards

● Big focus on educating the market
(if vendor).
● Scaling requirements are just now being
understood.
● Often only developers making decisions
rather than line of business; leads to R&D
focus rather than business value.
● Still not enough developers for serving all
AI needs.
Problems in the Industry Today for Innovators

Towards a more Integrated Approach
Through Gradual Adoption

● Minimize time to value through direct
integration in business processes (RPA).
● Manage models deployed from day 1 to
track ROI on experiments to minimize risk of
AI adoption and bound spending.
● Provide standardized tooling across an
organization to break down silos.
● Focus on continuous education of end users
and AI stakeholders for ever changing
market needs.
Goals

USE CASE:
Building Complex Machine Learning Solution with RPA

Building Complex Machine Learning Solutions with
Robotic Process Automation (RPA)
RPA Application
RPA Application

How AI Service System works with RPA
AI Service System

World Artificial Intelligence Conference Shanghai 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to World Artificial Intelligence Conference Shanghai 2018

Similar to World Artificial Intelligence Conference Shanghai 2018 (20)

More from Adam Gibson

More from Adam Gibson (20)

Recently uploaded

Recently uploaded (20)

World Artificial Intelligence Conference Shanghai 2018