Scaling Training Data for AI Applications

Scaling Training Data
for AI Applications
Ron Schmelzer and Kathleen Walch
Principal Analysts, Cognilytica
Kristin Simonini
VP of Product, Applause

Today’s Speakers
2
Ron Schmelzer
Principal Analyst, Cognilytica
Kathleen Walch
Principal Analyst, Cognilytica
Kristin Simonini
VP of Product, Applause

3
Today’s Agenda
• MakingAI a Reality
• The Seven Patterns of AI, andWhat RequiresTraining Data
• Leveraging a Global Community to SourceTraining Data
• Real Example of Overcoming Challenges of a SourcingTraining Data
Project

• Cognilytica is an AI & Cognitive Technology-focused research and
advisory firm.
• Produce market research, advisory and guidance on AI, ML, and
CognitiveTechnology
• Produce the popular AIToday podcast, in addition to infographic
series, whitepapers, webinars, newsletters, and other popular
content.
• Focused on enterprise and public sector adoption of AI, ML, and Cognitive
Technology
• Kathleen Walch and Ron Schmelzer are PrincipalAnalysts and Managing
Partners of Cognilytica
• Contributing writers to Forbes,TechTarget (SearchEnterpriseAI), Cognitive
World, and CTOVision
About Cognilytica
4

• Data is the heart, soul, juju, of AI
• The specific data you need depends on the business problem you’re
solving and the kinds of predictive or goal outcomes you’re looking for
• Activities for data collection:
• Identifying the required data on which to train
• Identifying all the dimensions required for that data for predictive
value of significance
• Identifying the features that are required
• Identifying the sources of data
• Identifying the means to aggregate that data
• There is no exact answer to the question “How much data is needed?”
Identifying Data Sets for ML: Data Collection
5

Making AI a Reality
6
The Seven Patterns of AI

• Machines and humans interacting with each other using natural language,
conversational forms of interaction across a variety of forms of
communication including voice, text, and written, and image forms.
• The objective of this pattern is machines interacting with humans the
way humans interact with each other.
The Conversation & Human Interaction Pattern
7

• Using ML to identify and understand images, sound, items,
handwriting, faces, and gestures.
• The objective of this pattern is to have machines identify and
understand the real world and unstructured data.
The Recognition Pattern
8

• Using machine learning and other cognitive
approaches to understand how to take past / existing
behavior and predict future outcomes or help
humans make decisions about future outcomes
using insight learned from past behavior /
interactions / data.
• The Objective of this pattern is helping humans make
better decisions
Predictive Analytics & Decision Support
9

• Machine learning (esp. Deep Learning) is good at
recognizing patterns
• If you can train it, you can detect it
• If you can train it, you can detect patterns… or
things that don’t fit patterns
Pattern & Anomaly Detection
10

• Physical and virtual (software) systems that are able to
accomplish a task, achieve a goal, interact with their
surroundings, and perform their objective with minimal
or any human involvement.
• The objective of this pattern is minimizing human labor
Autonomous Systems
11

• In order for Supervised Learning approaches to work, they must
be fed clean, well-labeled data that the system can use to
learn from example.
• But how do you get Labeled Data?
• Do it yourself
• Find a source of already labeled data
• Get your Users to Do it
• Hire a Contractor Workforce
• Contract withThird Party Data Labeling Firms
Data Labeling: The Achilles Heel of AI
12

The Data Preparation & Engineering Pipeline
Data Acquisition / Ingest / Capture
• ETL
• Cloud-based data
Merging
• Combining data sources
Cleaning
• Deduping, removing extraneous, bad data
Labeling
• Adding machine learning labels and annotations for training
purposes
Enhancing
• Adding necessary additional data for models
Filtering
• Eliminating bias
Feature Engineering
• Assisting with enhancement (see future on multiplying
data sets)
Retraining Pipelines
• Creation of pipelines to deal with model iteration

World’s Largest Community Of Vetted Digital Professionals
14
Available in real-time and selected to represent your customers.
Custom, Vetted Testing
& Feedback Teams
Any demographic, device, and region
to achieve your specific needs

Applause for AI: An End-to-End Solution
MACHINE-
LEARNING
ALGORITHM
Did it
understand
me?
Did I see or
hear what I
expected?
Did it respond
accurately?
Were the
recommend-
ations
relevant?
Was the
information
captured
correctly?
Was it easy
to use?
Speech
Video
Training Data Testing
Output
Text
Questions
Handwriting
Images

The Challenge: Sourcing Data for AI
16
• 81% of executives said training AI with data is more difficult than expected
• Main challenges included biased or erroneous data, not enough data, or
inability to label data.
• 60% of decision makers at firms adopting AI cite data quality as either
“challenging” or “very challenging.” (IDC)
• “Regardless of your beginner or expert AI status, data is the Debbie Downer of
any AI project.” (Forrester)

What we see in the Enterprise:
• You need LOTS of training data: Thousands to tens of
thousands of artifacts: Images,Videos, Documents,
Voice/Dialects
• You need QUALITY data, not just volume: Poor data
results in costly delays to the Product Development
Lifecycle
• You need a DIVERSE, global community of testers:
Gender, Age, Race, Language are must haves for today’s
AI applications. You can’t have one individual provide
100s of artifacts, you need 100s of testers to provide
single artifacts
• You need to be able to rapidly EVOLVE: As Product
team’s train the algorithm, they often need to change
their sourcing requirements if they are not getting an
expected output.
The Challenge: Sourcing Data for AI
17
Quantity
Diversity
Quality

How Applause Solves……….
18
Sourcing Quality Data at Scale
Leveraging a vetted community of over 400,000 testers in 200+ countries
enabling Applause to deliver a seamless sourcing solution that includes:
 Quality vs.Volume: We build agreements focused on usable data vs.
simple data collection
 Managed Service: End to End program that includes recruitment, quality
control, delivery, tester training
 Privacy and Security: Seamlessly manages the complex Privacy
landscape, including PII, HIPAA,GDPR and unique company confidential
requirements that may be required
 Elastic and Scalable: Unique business model enabling companies to
rapidly supporting evolving product and business requirements

HOW DOES THIS LOOK IN
PRACTICE?

How to
Source
Training Data
Use Case
20
Requirement:
Source thousands of real-world handwritten documents
• Blind collection with no PII data
• No one individual could submit more than a single document
• Minimum density required: Words per page

Challenge: Recruit a High Number of Diverse
Participants
21
• Training Data required thousands of pages of real handwriting across a variety of
documents and personal artifacts, including (but not limited to):
• Prescriptions/doctors notes
• Purchase orders
• Credit applications
• Personal essays and letters
• Drivers licenses and birth certificates from all 50 states
• Tax Forms
• Each handwriting sample had to be unique and could not be replicated across
types or groups
• The Applause service and platform is built to recruit and incentivize thousands of
testers to deliver documents with specific requirements, such as word density and
redaction of all personal information

Challenge: Extremely specific requirements
22
• On top of unique testers, there was a requirement for unique forms with specific
requirements
• Tax Forms required a diversity of different types:W-2, Pay stubs, IRS
1098-T, IRS 1099-R, IRS 1099-DIV, and others
• Each document had specifications
• No more than 1 single folded margin in the middle
• No deformations on the page
• Minimum number of words per page
• Each document needed to be authentic, but with minimal redactions
• Automation only gets you so far. You need a proven QA andValidation process
that is staffed by an experienced team to check multiple requirements and
dependencies

Challenge: Meeting Privacy and Confidentiality
requirements
23
• Sourcing training data for AI Applications means they are typically in
“development” and the collection process needs to meet stringent confidential
requirements
• Privacy laws and policies need to be accounted across different states, countries
and regulatory
• The Applause process and service ensures that sourcing can be blind to the testers
to protect confidentiality while also insuring documents are redacted to account
for all relevant laws, such as GDPR, HIPAA, PII. This includes replacing sensitive
data with “dummy” data as needed.

Things to
Consider
24
 Diversity of testers
 Privacy concerns
 Recruit and train participants
 Ensure quality data
 Execute this at scale
 Evolve as your needs change

Scaling Training Data for AI Applications

Scaling Training Data for AI Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling Training Data for AI Applications

Similar to Scaling Training Data for AI Applications (20)

More from Applause

More from Applause (20)

Recently uploaded

Recently uploaded (20)

Scaling Training Data for AI Applications

Editor's Notes