Image Recognition on Streaming Data

Image Recognition
on Streaming Data
Neil Dahlke, Senior Solutions Engineer
15 November 2017

AT MEMSQL
Senior Solutions Engineer, San Francisco
BEFORE MEMSQL
I worked on Globus, a high performance data transfer tool for
research scientists, out of the University of Chicago in
coordination with Argonne National Lab.
PREVIOUS TALKS
Real Time, Geospatial, Maps (slides)
Streaming in the Enterprise (slides)
Real Time Analytics with Spark and MemSQL (slides)
2
Me at a Glance

The future of
computing is
visual…

Mapping Social Imagery Handwriting
and many more.

But first, let’s create a shared vocabulary.

Easy to setup real-time data
pipelines with exactly-once
semantics
Streaming Data Ingest
Memory optimized tables for
analyzing real-time events
Live Data
Disk optimized tables with up to
10x compression and vectorized
queries for fast analytics
Historical Data
11
MemSQL at a Glance

Data Loading Query Latency
Concurrency
FAST LOW
Vectorized queries
Real-time dashboards
Live data access
Multi-threaded processing
Transactions and Analytics
Scalable performance
HIGH
Stream data
Real-time loading
Full data access
12

• Distributed, ANSI SQL, database
• Full ACID features
• Lock free, shared nothing
• Compiled queries
• Massively parallel
• Geospatial and JSON
• In-memory and on-disk
• MySQL protocol
• Streaming
• HTAP (rowstore and columnstore)
MemSQL in One Slide
13

Architecture: MemSQL Building Blocks
memsqld
14

Architecture: Aggregators and Leaves
Agg 1 Agg 2
Leaf 1 Leaf 2 Leaf 3 Leaf 4
15

Architecture: Aggregators Aggregate
Agg 1 Agg 2
16

Architecture: Leaves Hold Partitions
Agg 1 Agg 2
17

Architecture: It’s SQL All The Way Down
Agg 1 Agg 2
agg1> select avg(price) from
orders;
leaf1> using memsql_demo_0
select count(1), sum(price)
from orders;
leaf2> using memsql_demo_12
select count(1), sum(price)
from orders;
...
18

Architecture: High Availability
Leaf 1 Leaf 2 Leaf 4>_<
Agg 1 Agg 2
▪ Leaves are paired up
▪ Replicated async by default
▪ Automatically fails over
▪ Automatically re-attaches

Architecture: Scaling
memsqld
20

Agg 1 Agg 2
21

Agg 1 Agg 2
Leaf 1 Leaf 2 Leaf 3 Leaf 4 memsqld memsqld
22

Agg 1 Agg 2
Leaf 1 Leaf 2 Leaf 3 Leaf 4 Leaf 5 Leaf 6
agg1> add leaf ...
23

Load
Ingest from Apache Kafka,
Amazon S3, Azure Blob
Store, or remote file system.
Guarantee message
delivery with exactly-once
semantics
Transform
Map and enrich data with
user defined or Apache
Spark transformations
MemSQL Streaming
Extract
25

memsql> CREATE PIPELINE twitter_pipeline AS
-> LOAD DATA KAFKA "public-kafka.memcompute.com:9092/tweets-json"
-> INTO TABLE tweets
-> (id, tweet);
Query OK, (0.89 sec)
memsql> START PIPELINE twitter_pipeline;
Query OK, (0.01 sec)
memsql> SELECT text FROM tweets ORDER BY id DESC LIMIT 5G
26
Simple Streaming Setup with CREATE PIPELINE

27
MemSQL Pipelines Sequence
Data Sources MemSQL
1. Extract 2. Transform extracted data 3. Load into Database tables
Pipelines

28
MemSQL Pipelines Architecture: Kafka
Kafka
Broker
MemSQL LeafPipelines
Kafka
Broker
Kafka
Broker
MemSQL AggPipelines
1. Extract 2. Transform 3. Load
Data
reshuffle
Metadata query

100k
escort ads posted
every day in this
country
40M images in the
database

35
Real-Time Image Recognition Workflow
▪ Train a model with Spark and TensorFlow
▪ Use the Model to extract feature vectors from images
• Model + Image => FV
▪ You can store every feature vector in a MemSQL table
CREATE TABLE features (
id bigint(11) NOT NULL,
image binary(4096) DEFAULT NULL,
KEY id (id)USING CLUSTERED COLUMNSTORE
);

4,996 POINTS
 CLASSIFICATION
 DE-DUPLICATION
 MATCHING

949:0.026740,961:0.011758,962:0.01 ...

949:0.026740,961:0.011758,962:0.01 ...
12:0.005868,16:0.004575,49:0.002
193,52:0.009880,67:0.034832,72:0.
030992,77:0.012170,108:0.012382,
120:0.012916,125:0.005741,137:0.
015322,143:0.020548,157:0.03040
7,220:0.061202,228:0.026140,232:
0.040047,236:0.023434,242:0.0266
05,252:0.007459,264:0.022012,269
:0.016690,270:0.057932,282:0.011
975,292:0.028855,298:0.006937,31
7:0.005120,333:0.028555,338:0.03
9100,348:0.017727,358:0.055682,3
76:0.006209,386:0.028764,413:0.0
17220,417:0.018298,422:0.004943,
433:0.031690,443:0.011401,451:0.
016825,452:0.000745,458:0.01076
9,460:0.044923,471:0.039836,479:
0.008343,482:0.009446,484:0.0194
43,497:0.061289,502:0.015072,508
:0.029485,530:0.013753,532:0.007
153,543:0.044873,551:0.010136,55
5:0.012994,560:0.008001,563:0.03
8678,579:0.015128,610:0.007795,6
27:0.019286,634:0.021111,641:0.0
07065,642:0.007089,659:0.058285,
672:0.018122,674:0.024745,703:0.
012181,704:0.010520,705:0.01980
5,726:0.004800,734:0.020477,751:
0.005154,753:0.023470,763:0.0026
51,783:0.033653,786:0.010800,824
:0.017787,846:0.017696,850:0.040
618,853:0.006627,880:0.020177,88
7:0.040712,901:0.004130,902:0.01
2970,926:0.011321,949:0.026740,9
61:0.
067235,1551:0.002643,1569:0.030
303,1592:0.000982,1595:0.021256
,1606:0.029090,1619:0.030494,16
28:0.007809,1630:0.012805,1632:
0.074610,1658:0.046989,1663:0.0
11392,1683:0.025755,1689:0.0005
51,1690:0.019549,1707:0.002039,
1718:0.000027,1753:0.003988,176
1:0.016639,1787:0.004682,1788:0.
036989,1793:0.010178,1799:0.032
016,1820:0.001699,1862:0.026061
,1865:0.033358,1888:0.015540,18
93:0.015230,1913:0.029057,1917:
0.017459,1930:0.012725,1932:0.0
20591,1939:0.036401,1940:0.0014
55,1941:0.029777,1948:0.028731,
1950:0.015147,1966:0.008172,197
6:0.004087,2009:0.005937,2011:0.
026532,2016:0.018998,2023:0.003
567,2024:0.033425,2043:0.024501
,2060:0.035672,2077:0.026460,20
92:0.006496,2099:0.042786,2110:
0.031982,2117:0.026819,2118:0.0
02956,2127:0.002132,2171:0.0066
93,2174:0.006085,2193:0.038693,
2207:0.080437,2210:0.036449,221
5:0.027432,2216:0.000524,2228:0.
022542,2232:0.023016,2245:0.035
095,2258:0.008138,2291:0.014170
,2297:0.024569,2301:0.019651,23
10:0.037032,2333:0.010741,2337:
0.010183,2353:0.056520,2382:0.0
05700,2406:0.012346,2409:0.0459
50,2411:0.005816,2415:0.001264,
2424:0.046932,2439:0.010018,.
033653,786:0.010800,824:0.0177
87,846:0.017696,850:0.040618,8
53:0.006627,880:0.020177,887:0.
040712,901:0.004130,902:0.0129
70,926:0.011321,949:0.026740,9
61:0.011758,962:0.01,.003080,96
6:0.025391,969:0.008317,980:0.0
24180,999:0.025001,1003:0.0099
95,1018:0.026575,1024:0.014152
,1030:0.014807,1032:0.001685,1
037:0.059401,1041:0.008451,108
3:0.004498,1086:0.042539,1100:
0.019762,1107:0.003233,1111:0.
010055,1118:0.004970,1120:0.01
3391,1137:0.033611,1143:0.0041
84,1151:0.011988,1156:0.018991
,1164:0.005059,1165:0.009926,1
171:0.041736,1181:0.009872,118
7:0.001813,1188:0.010391,1193:
0.020764,1194:0.002471,1222:0.
006705,1238:0.009757,1246:0.06
7453,1259:0.042624,1264:0.0175
58,1265:0.019401,1269:0.015384
,1299:0.013593,1310:0.002139,1
359:0.006642,1371:0.034178,137
4:0.016396,1384:0.022928,1404:
0.017169,1408:0.009406,1418:0.
073914,1420:0.011940,1421:0.00
5672,1430:0.003974,1433:0.0027
76,1463:0.031537,1481:0.000885
,1485:0.039955,1492:0.023929,1
494:0.048229,1497:0.053608,150
8:0.003894,1518:0.011840,1524:
0.011318,1528:0.

39
Working with Feature Vectors
For every image we store an ID and a normalized feature vector in a MemSQL table called
features.
ID | Feature Vector
x | 4KB
To find similar images using cosine similarity, we use this SQL query:
SELECT
id
FROM
feature_vectors
WHERE
DOT_PRODUCT(image, 0xDEADBEEF) > 0.9

41
Understanding Dot Product
▪ Dot Product is an algebraic operation
• X = (x1, …, xN), Y = (y1, …, yN)
• (X*Y) = SUM(Xi * Yi)
▪ With the specific model and normalized feature vectors
DOT PRODUCT results in a similarity score.
• The closer the score is to 1 the more similar are the images

42
Understanding SIMD
 Intel AVX-2
 256-bit registers
 Pack multiple values per
register
 Special instructions for
SIMD register operations
 Arithmetic, logic, load,
store etc.
 Allows multiple
operations in 1 instruction
1 2 3 4
1 1 1 1
2 3 4 5
+
MemSQL Confidential

VectorizedNot Vectorized
Single row, Single instruction
CPU constrained
10,000 rows / sec / core
Multiple rows, Single instruction
CPU optimized
1,000,000,000 rows / sec / core
Understanding Query Vectorization

44
Performance expectations
▪ Memory speed: ~50GB/sec
▪ Vector size: 4KB
▪ 12.5 Million Images a second per node
▪ 1 Billion images a second on 100 node cluster

949:0.026740,961:0.011758,962:0.01 ...
12:0.005868,16:0.004575,49:0.002
193,52:0.009880,67:0.034832,72:0.
030992,77:0.012170,108:0.012382,
120:0.012916,125:0.005741,137:0.
015322,143:0.020548,157:0.03040
7,220:0.061202,228:0.026140,232:
0.040047,236:0.023434,242:0.0266
05,252:0.007459,264:0.022012,269
:0.016690,270:0.057932,282:0.011
975,292:0.028855,298:0.006937,31
7:0.005120,333:0.028555,338:0.03
9100,348:0.017727,358:0.055682,3
76:0.006209,386:0.028764,413:0.0
17220,417:0.018298,422:0.004943,
433:0.031690,443:0.011401,451:0.
016825,452:0.000745,458:0.01076
9,460:0.044923,471:0.039836,479:
0.008343,482:0.009446,484:0.0194
43,497:0.061289,502:0.015072,508
:0.029485,530:0.013753,532:0.007
153,543:0.044873,551:0.010136,55
5:0.012994,560:0.008001,563:0.03
8678,579:0.015128,610:0.007795,6
27:0.019286,634:0.021111,641:0.0
07065,642:0.007089,659:0.058285,
672:0.018122,674:0.024745,703:0.
012181,704:0.010520,705:0.01980
5,726:0.004800,734:0.020477,751:
0.005154,753:0.023470,763:0.0026
51,783:0.033653,786:0.010800,824
:0.017787,846:0.017696,850:0.040
618,853:0.006627,880:0.020177,88
7:0.040712,901:0.004130,902:0.01
2970,926:0.011321,949:0.026740,9
61:0.
067235,1551:0.002643,1569:0.030
303,1592:0.000982,1595:0.021256
,1606:0.029090,1619:0.030494,16
28:0.007809,1630:0.012805,1632:
0.074610,1658:0.046989,1663:0.0
11392,1683:0.025755,1689:0.0005
51,1690:0.019549,1707:0.002039,
1718:0.000027,1753:0.003988,176
1:0.016639,1787:0.004682,1788:0.
036989,1793:0.010178,1799:0.032
016,1820:0.001699,1862:0.026061
,1865:0.033358,1888:0.015540,18
93:0.015230,1913:0.029057,1917:
0.017459,1930:0.012725,1932:0.0
20591,1939:0.036401,1940:0.0014
55,1941:0.029777,1948:0.028731,
1950:0.015147,1966:0.008172,197
6:0.004087,2009:0.005937,2011:0.
026532,2016:0.018998,2023:0.003
567,2024:0.033425,2043:0.024501
,2060:0.035672,2077:0.026460,20
92:0.006496,2099:0.042786,2110:
0.031982,2117:0.026819,2118:0.0
02956,2127:0.002132,2171:0.0066
93,2174:0.006085,2193:0.038693,
2207:0.080437,2210:0.036449,221
5:0.027432,2216:0.000524,2228:0.
022542,2232:0.023016,2245:0.035
095,2258:0.008138,2291:0.014170
,2297:0.024569,2301:0.019651,23
10:0.037032,2333:0.010741,2337:
0.010183,2353:0.056520,2382:0.0
05700,2406:0.012346,2409:0.0459
50,2411:0.005816,2415:0.001264,
2424:0.046932,2439:0.010018,.
033653,786:0.010800,824:0.0177
87,846:0.017696,850:0.040618,8
53:0.006627,880:0.020177,887:0.
040712,901:0.004130,902:0.0129
70,926:0.011321,949:0.026740,9
61:0.011758,962:0.01,.003080,96
6:0.025391,969:0.008317,980:0.0
24180,999:0.025001,1003:0.0099
95,1018:0.026575,1024:0.014152
,1030:0.014807,1032:0.001685,1
037:0.059401,1041:0.008451,108
3:0.004498,1086:0.042539,1100:
0.019762,1107:0.003233,1111:0.
010055,1118:0.004970,1120:0.01
3391,1137:0.033611,1143:0.0041
84,1151:0.011988,1156:0.018991
,1164:0.005059,1165:0.009926,1
171:0.041736,1181:0.009872,118
7:0.001813,1188:0.010391,1193:
0.020764,1194:0.002471,1222:0.
006705,1238:0.009757,1246:0.06
7453,1259:0.042624,1264:0.0175
58,1265:0.019401,1269:0.015384
,1299:0.013593,1310:0.002139,1
359:0.006642,1371:0.034178,137
4:0.016396,1384:0.022928,1404:
0.017169,1408:0.009406,1418:0.
073914,1420:0.011940,1421:0.00
5672,1430:0.003974,1433:0.0027
76,1463:0.031537,1481:0.000885
,1485:0.039955,1492:0.023929,1
494:0.048229,1497:0.053608,150
8:0.003894,1518:0.011840,1524:
0.011318,1528:0.
100s of millions of
images to match

47
Performance Enhancing Techniques
Achieving best-in-class dot product implementation
▪ SIMD-powered vectorized execution
▪ Data compression
▪ Query parallelism
▪ Scale out
▪ Result: Processing at Memory Bandwidth Speed

Real-Time
Application
New
Image
Stream
Real-Time
Processing
Reference
Image Store
TensorFlow
Streaming Real Time Image Recognition Workflow

49
MemSQL gives us…
▪ Performance
▪ Scalability
▪ High concurrency
▪ Real-time (operational)
▪ Compatibility (BI, Spark, Kafka, ETL, etc)
▪ Hybrid deployment
▪ Robustness, durability, security

Image Recognition on Streaming Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Image Recognition on Streaming Data

Similar to Image Recognition on Streaming Data (20)

More from SingleStore

More from SingleStore (20)

Recently uploaded

Recently uploaded (20)

Image Recognition on Streaming Data

Editor's Notes