尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Scaling Training Data
for AI Applications
Ron Schmelzer and Kathleen Walch
Principal Analysts, Cognilytica
Kristin Simonini
VP of Product, Applause
Today’s Speakers
2
Ron Schmelzer
Principal Analyst, Cognilytica
Kathleen Walch
Principal Analyst, Cognilytica
Kristin Simonini
VP of Product, Applause
3
Today’s Agenda
• MakingAI a Reality
• The Seven Patterns of AI, andWhat RequiresTraining Data
• Leveraging a Global Community to SourceTraining Data
• Real Example of Overcoming Challenges of a SourcingTraining Data
Project
• Cognilytica is an AI & Cognitive Technology-focused research and
advisory firm.
• Produce market research, advisory and guidance on AI, ML, and
CognitiveTechnology
• Produce the popular AIToday podcast, in addition to infographic
series, whitepapers, webinars, newsletters, and other popular
content.
• Focused on enterprise and public sector adoption of AI, ML, and Cognitive
Technology
• Kathleen Walch and Ron Schmelzer are PrincipalAnalysts and Managing
Partners of Cognilytica
• Contributing writers to Forbes,TechTarget (SearchEnterpriseAI), Cognitive
World, and CTOVision
About Cognilytica
4
• Data is the heart, soul, juju, of AI
• The specific data you need depends on the business problem you’re
solving and the kinds of predictive or goal outcomes you’re looking for
• Activities for data collection:
• Identifying the required data on which to train
• Identifying all the dimensions required for that data for predictive
value of significance
• Identifying the features that are required
• Identifying the sources of data
• Identifying the means to aggregate that data
• There is no exact answer to the question “How much data is needed?”
Identifying Data Sets for ML: Data Collection
5
Making AI a Reality
6
The Seven Patterns of AI
• Machines and humans interacting with each other using natural language,
conversational forms of interaction across a variety of forms of
communication including voice, text, and written, and image forms.
• The objective of this pattern is machines interacting with humans the
way humans interact with each other.
The Conversation & Human Interaction Pattern
7
• Using ML to identify and understand images, sound, items,
handwriting, faces, and gestures.
• The objective of this pattern is to have machines identify and
understand the real world and unstructured data.
The Recognition Pattern
8
• Using machine learning and other cognitive
approaches to understand how to take past / existing
behavior and predict future outcomes or help
humans make decisions about future outcomes
using insight learned from past behavior /
interactions / data.
• The Objective of this pattern is helping humans make
better decisions
Predictive Analytics & Decision Support
9
• Machine learning (esp. Deep Learning) is good at
recognizing patterns
• If you can train it, you can detect it
• If you can train it, you can detect patterns… or
things that don’t fit patterns
Pattern & Anomaly Detection
10
• Physical and virtual (software) systems that are able to
accomplish a task, achieve a goal, interact with their
surroundings, and perform their objective with minimal
or any human involvement.
• The objective of this pattern is minimizing human labor
Autonomous Systems
11
• In order for Supervised Learning approaches to work, they must
be fed clean, well-labeled data that the system can use to
learn from example.
• But how do you get Labeled Data?
• Do it yourself
• Find a source of already labeled data
• Get your Users to Do it
• Hire a Contractor Workforce
• Contract withThird Party Data Labeling Firms
Data Labeling: The Achilles Heel of AI
12
The Data Preparation & Engineering Pipeline
Data Acquisition / Ingest / Capture
• ETL
• Cloud-based data
Merging
• Combining data sources
Cleaning
• Deduping, removing extraneous, bad data
Labeling
• Adding machine learning labels and annotations for training
purposes
Enhancing
• Adding necessary additional data for models
Filtering
• Eliminating bias
Feature Engineering
• Assisting with enhancement (see future on multiplying
data sets)
Retraining Pipelines
• Creation of pipelines to deal with model iteration
World’s Largest Community Of Vetted Digital Professionals
14
Available in real-time and selected to represent your customers.
Custom, Vetted Testing
& Feedback Teams
Any demographic, device, and region
to achieve your specific needs
Applause for AI: An End-to-End Solution
MACHINE-
LEARNING
ALGORITHM
Did it
understand
me?
Did I see or
hear what I
expected?
Did it respond
accurately?
Were the
recommend-
ations
relevant?
Was the
information
captured
correctly?
Was it easy
to use?
Speech
Video
Training Data Testing
Output
Text
Questions
Handwriting
Images
The Challenge: Sourcing Data for AI
16
• 81% of executives said training AI with data is more difficult than expected
• Main challenges included biased or erroneous data, not enough data, or
inability to label data.
• 60% of decision makers at firms adopting AI cite data quality as either
“challenging” or “very challenging.” (IDC)
• “Regardless of your beginner or expert AI status, data is the Debbie Downer of
any AI project.” (Forrester)
What we see in the Enterprise:
• You need LOTS of training data: Thousands to tens of
thousands of artifacts: Images,Videos, Documents,
Voice/Dialects
• You need QUALITY data, not just volume: Poor data
results in costly delays to the Product Development
Lifecycle
• You need a DIVERSE, global community of testers:
Gender, Age, Race, Language are must haves for today’s
AI applications. You can’t have one individual provide
100s of artifacts, you need 100s of testers to provide
single artifacts
• You need to be able to rapidly EVOLVE: As Product
team’s train the algorithm, they often need to change
their sourcing requirements if they are not getting an
expected output.
The Challenge: Sourcing Data for AI
17
Quantity
Diversity
Quality
How Applause Solves……….
18
Sourcing Quality Data at Scale
Leveraging a vetted community of over 400,000 testers in 200+ countries
enabling Applause to deliver a seamless sourcing solution that includes:
 Quality vs.Volume: We build agreements focused on usable data vs.
simple data collection
 Managed Service: End to End program that includes recruitment, quality
control, delivery, tester training
 Privacy and Security: Seamlessly manages the complex Privacy
landscape, including PII, HIPAA,GDPR and unique company confidential
requirements that may be required
 Elastic and Scalable: Unique business model enabling companies to
rapidly supporting evolving product and business requirements
HOW DOES THIS LOOK IN
PRACTICE?
How to
Source
Training Data
Use Case
20
Requirement:
Source thousands of real-world handwritten documents
• Blind collection with no PII data
• No one individual could submit more than a single document
• Minimum density required: Words per page
Challenge: Recruit a High Number of Diverse
Participants
21
• Training Data required thousands of pages of real handwriting across a variety of
documents and personal artifacts, including (but not limited to):
• Prescriptions/doctors notes
• Purchase orders
• Credit applications
• Personal essays and letters
• Drivers licenses and birth certificates from all 50 states
• Tax Forms
• Each handwriting sample had to be unique and could not be replicated across
types or groups
• The Applause service and platform is built to recruit and incentivize thousands of
testers to deliver documents with specific requirements, such as word density and
redaction of all personal information
Challenge: Extremely specific requirements
22
• On top of unique testers, there was a requirement for unique forms with specific
requirements
• Tax Forms required a diversity of different types:W-2, Pay stubs, IRS
1098-T, IRS 1099-R, IRS 1099-DIV, and others
• Each document had specifications
• No more than 1 single folded margin in the middle
• No deformations on the page
• Minimum number of words per page
• Each document needed to be authentic, but with minimal redactions
• Automation only gets you so far. You need a proven QA andValidation process
that is staffed by an experienced team to check multiple requirements and
dependencies
Challenge: Meeting Privacy and Confidentiality
requirements
23
• Sourcing training data for AI Applications means they are typically in
“development” and the collection process needs to meet stringent confidential
requirements
• Privacy laws and policies need to be accounted across different states, countries
and regulatory
• The Applause process and service ensures that sourcing can be blind to the testers
to protect confidentiality while also insuring documents are redacted to account
for all relevant laws, such as GDPR, HIPAA, PII. This includes replacing sensitive
data with “dummy” data as needed.
Things to
Consider
24
 Diversity of testers
 Privacy concerns
 Recruit and train participants
 Ensure quality data
 Execute this at scale
 Evolve as your needs change
Q&A
Scaling Training Data for AI Applications

More Related Content

What's hot

Data Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsData Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & Insights
Yael Garten
 
Cognitive computing in Insurance
Cognitive computing in InsuranceCognitive computing in Insurance
Cognitive computing in Insurance
Anders Quitzau
 
Integrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and PerficientIntegrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and Perficient
Perficient, Inc.
 
AI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AIAI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AI
NUS-ISS
 
Earley Executive Roundtable on Data Analytics - Session 1 - The Business Pote...
Earley Executive Roundtable on Data Analytics - Session 1 - The Business Pote...Earley Executive Roundtable on Data Analytics - Session 1 - The Business Pote...
Earley Executive Roundtable on Data Analytics - Session 1 - The Business Pote...
Earley Information Science
 
TatvaSoft Company Profile
TatvaSoft Company ProfileTatvaSoft Company Profile
TatvaSoft Company Profile
Shweta Dastidar
 
Earley Executive Roundtable on Data Analytics - Session 2 - Mining Business I...
Earley Executive Roundtable on Data Analytics - Session 2 - Mining Business I...Earley Executive Roundtable on Data Analytics - Session 2 - Mining Business I...
Earley Executive Roundtable on Data Analytics - Session 2 - Mining Business I...
Earley Information Science
 
Ds.ai applied ai-workshop-
Ds.ai applied ai-workshop-Ds.ai applied ai-workshop-
Ds.ai applied ai-workshop-
Jothi Periasamy
 
Bg wesleyan liberal arts to silicon valley oct 2016
Bg wesleyan liberal arts to silicon valley oct 2016Bg wesleyan liberal arts to silicon valley oct 2016
Bg wesleyan liberal arts to silicon valley oct 2016
Bhaskar Ghosh
 
Big Data : From HindSight to Insight to Foresight
Big Data : From HindSight to Insight to ForesightBig Data : From HindSight to Insight to Foresight
Big Data : From HindSight to Insight to Foresight
Sunil Ranka
 
Earley Executive Roundtable Using Business Analytics to Drive Higher ROI and ...
Earley Executive Roundtable Using Business Analytics to Drive Higher ROI and ...Earley Executive Roundtable Using Business Analytics to Drive Higher ROI and ...
Earley Executive Roundtable Using Business Analytics to Drive Higher ROI and ...
Earley Information Science
 
I, project manager, The rise of artificial intelligence in the world of proje...
I, project manager, The rise of artificial intelligence in the world of proje...I, project manager, The rise of artificial intelligence in the world of proje...
I, project manager, The rise of artificial intelligence in the world of proje...
PMILebanonChapter
 
Rpa, ai etc. at et canada exchange nov 2017, dr r babin
Rpa, ai etc. at et canada exchange   nov 2017, dr r babinRpa, ai etc. at et canada exchange   nov 2017, dr r babin
Rpa, ai etc. at et canada exchange nov 2017, dr r babin
rbabin
 
How to Evolve Intelligence Organizations for Maximum Success
How to Evolve Intelligence Organizations for Maximum SuccessHow to Evolve Intelligence Organizations for Maximum Success
How to Evolve Intelligence Organizations for Maximum Success
Arik Johnson
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With Data
Peter Skomoroch
 
Understanding Cognitive Applications: A Framework - Sue Feldman
Understanding Cognitive Applications:  A Framework - Sue FeldmanUnderstanding Cognitive Applications:  A Framework - Sue Feldman
Understanding Cognitive Applications: A Framework - Sue Feldman
diannepatricia
 
How to Prepare for 2025's Intelligence Technology
How to Prepare for 2025's Intelligence TechnologyHow to Prepare for 2025's Intelligence Technology
How to Prepare for 2025's Intelligence Technology
Arik Johnson
 
Final presentation - Master data management - Half Scheidl
Final presentation - Master data management - Half ScheidlFinal presentation - Master data management - Half Scheidl
Final presentation - Master data management - Half Scheidl
hascheidl
 
Workshop on AI - introductory lecture
Workshop on AI - introductory lectureWorkshop on AI - introductory lecture
Workshop on AI - introductory lecture
Subhendu Dey
 
AI in the Enterprise
AI in the EnterpriseAI in the Enterprise
AI in the Enterprise
Ron Bodkin
 

What's hot (20)

Data Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsData Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & Insights
 
Cognitive computing in Insurance
Cognitive computing in InsuranceCognitive computing in Insurance
Cognitive computing in Insurance
 
Integrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and PerficientIntegrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and Perficient
 
AI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AIAI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AI
 
Earley Executive Roundtable on Data Analytics - Session 1 - The Business Pote...
Earley Executive Roundtable on Data Analytics - Session 1 - The Business Pote...Earley Executive Roundtable on Data Analytics - Session 1 - The Business Pote...
Earley Executive Roundtable on Data Analytics - Session 1 - The Business Pote...
 
TatvaSoft Company Profile
TatvaSoft Company ProfileTatvaSoft Company Profile
TatvaSoft Company Profile
 
Earley Executive Roundtable on Data Analytics - Session 2 - Mining Business I...
Earley Executive Roundtable on Data Analytics - Session 2 - Mining Business I...Earley Executive Roundtable on Data Analytics - Session 2 - Mining Business I...
Earley Executive Roundtable on Data Analytics - Session 2 - Mining Business I...
 
Ds.ai applied ai-workshop-
Ds.ai applied ai-workshop-Ds.ai applied ai-workshop-
Ds.ai applied ai-workshop-
 
Bg wesleyan liberal arts to silicon valley oct 2016
Bg wesleyan liberal arts to silicon valley oct 2016Bg wesleyan liberal arts to silicon valley oct 2016
Bg wesleyan liberal arts to silicon valley oct 2016
 
Big Data : From HindSight to Insight to Foresight
Big Data : From HindSight to Insight to ForesightBig Data : From HindSight to Insight to Foresight
Big Data : From HindSight to Insight to Foresight
 
Earley Executive Roundtable Using Business Analytics to Drive Higher ROI and ...
Earley Executive Roundtable Using Business Analytics to Drive Higher ROI and ...Earley Executive Roundtable Using Business Analytics to Drive Higher ROI and ...
Earley Executive Roundtable Using Business Analytics to Drive Higher ROI and ...
 
I, project manager, The rise of artificial intelligence in the world of proje...
I, project manager, The rise of artificial intelligence in the world of proje...I, project manager, The rise of artificial intelligence in the world of proje...
I, project manager, The rise of artificial intelligence in the world of proje...
 
Rpa, ai etc. at et canada exchange nov 2017, dr r babin
Rpa, ai etc. at et canada exchange   nov 2017, dr r babinRpa, ai etc. at et canada exchange   nov 2017, dr r babin
Rpa, ai etc. at et canada exchange nov 2017, dr r babin
 
How to Evolve Intelligence Organizations for Maximum Success
How to Evolve Intelligence Organizations for Maximum SuccessHow to Evolve Intelligence Organizations for Maximum Success
How to Evolve Intelligence Organizations for Maximum Success
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With Data
 
Understanding Cognitive Applications: A Framework - Sue Feldman
Understanding Cognitive Applications:  A Framework - Sue FeldmanUnderstanding Cognitive Applications:  A Framework - Sue Feldman
Understanding Cognitive Applications: A Framework - Sue Feldman
 
How to Prepare for 2025's Intelligence Technology
How to Prepare for 2025's Intelligence TechnologyHow to Prepare for 2025's Intelligence Technology
How to Prepare for 2025's Intelligence Technology
 
Final presentation - Master data management - Half Scheidl
Final presentation - Master data management - Half ScheidlFinal presentation - Master data management - Half Scheidl
Final presentation - Master data management - Half Scheidl
 
Workshop on AI - introductory lecture
Workshop on AI - introductory lectureWorkshop on AI - introductory lecture
Workshop on AI - introductory lecture
 
AI in the Enterprise
AI in the EnterpriseAI in the Enterprise
AI in the Enterprise
 

Similar to Scaling Training Data for AI Applications

Getting Knowledge Transfer Right Enterprise Wide Webinar
Getting Knowledge Transfer Right Enterprise Wide WebinarGetting Knowledge Transfer Right Enterprise Wide Webinar
Getting Knowledge Transfer Right Enterprise Wide Webinar
Concept Searching, Inc
 
Liberating data power of APIs
Liberating data power of APIsLiberating data power of APIs
Liberating data power of APIs
Bala Iyer
 
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Innovation Enterprise
 
Machine Learning in Customer Analytics
Machine Learning in Customer AnalyticsMachine Learning in Customer Analytics
Machine Learning in Customer Analytics
Course5i
 
Valuing the data asset
Valuing the data assetValuing the data asset
Valuing the data asset
Bala Iyer
 
An AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven OrganizationAn AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven Organization
David Solomon
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
Data Science Milan
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
Priyesh Patel
 
AI and the Financial Service Segment
AI and the Financial Service SegmentAI and the Financial Service Segment
AI and the Financial Service Segment
Graeme Wood
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you think
Peter Skomoroch
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
SoftServe
 
Cloudera Fast Forward Labs: Accelerate machine learning
Cloudera Fast Forward Labs: Accelerate machine learningCloudera Fast Forward Labs: Accelerate machine learning
Cloudera Fast Forward Labs: Accelerate machine learning
Cloudera, Inc.
 
Intro to Artificial Intelligence w/ Target's Director of PM
 Intro to Artificial Intelligence w/ Target's Director of PM Intro to Artificial Intelligence w/ Target's Director of PM
Intro to Artificial Intelligence w/ Target's Director of PM
Product School
 
Week2 chapters1 3
Week2 chapters1 3Week2 chapters1 3
Week2 chapters1 3
Michael Germano
 
Steve Walker & Seth Earley - Understanding the DX Ecosystem & Developing a Ma...
Steve Walker & Seth Earley - Understanding the DX Ecosystem & Developing a Ma...Steve Walker & Seth Earley - Understanding the DX Ecosystem & Developing a Ma...
Steve Walker & Seth Earley - Understanding the DX Ecosystem & Developing a Ma...
Digital Experience (DX) Summit 2016
 
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
Steven Callahan
 
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
Enterprise Knowledge
 
Fintech workshop Part I - Law Society of Hong Kong - Xccelerate
Fintech workshop Part I - Law Society of Hong Kong - XccelerateFintech workshop Part I - Law Society of Hong Kong - Xccelerate
Fintech workshop Part I - Law Society of Hong Kong - Xccelerate
Henrique Centieiro
 
Data-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data ModelingData-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data Modeling
DATAVERSITY
 
Data-Ed: Trends in Data Modeling
Data-Ed: Trends in Data ModelingData-Ed: Trends in Data Modeling
Data-Ed: Trends in Data Modeling
Data Blueprint
 

Similar to Scaling Training Data for AI Applications (20)

Getting Knowledge Transfer Right Enterprise Wide Webinar
Getting Knowledge Transfer Right Enterprise Wide WebinarGetting Knowledge Transfer Right Enterprise Wide Webinar
Getting Knowledge Transfer Right Enterprise Wide Webinar
 
Liberating data power of APIs
Liberating data power of APIsLiberating data power of APIs
Liberating data power of APIs
 
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
 
Machine Learning in Customer Analytics
Machine Learning in Customer AnalyticsMachine Learning in Customer Analytics
Machine Learning in Customer Analytics
 
Valuing the data asset
Valuing the data assetValuing the data asset
Valuing the data asset
 
An AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven OrganizationAn AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven Organization
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
 
AI and the Financial Service Segment
AI and the Financial Service SegmentAI and the Financial Service Segment
AI and the Financial Service Segment
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you think
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
Cloudera Fast Forward Labs: Accelerate machine learning
Cloudera Fast Forward Labs: Accelerate machine learningCloudera Fast Forward Labs: Accelerate machine learning
Cloudera Fast Forward Labs: Accelerate machine learning
 
Intro to Artificial Intelligence w/ Target's Director of PM
 Intro to Artificial Intelligence w/ Target's Director of PM Intro to Artificial Intelligence w/ Target's Director of PM
Intro to Artificial Intelligence w/ Target's Director of PM
 
Week2 chapters1 3
Week2 chapters1 3Week2 chapters1 3
Week2 chapters1 3
 
Steve Walker & Seth Earley - Understanding the DX Ecosystem & Developing a Ma...
Steve Walker & Seth Earley - Understanding the DX Ecosystem & Developing a Ma...Steve Walker & Seth Earley - Understanding the DX Ecosystem & Developing a Ma...
Steve Walker & Seth Earley - Understanding the DX Ecosystem & Developing a Ma...
 
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
20140826 I&T Webinar_The Proliferation of Data - Finding Meaning Amidst the N...
 
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
Translating AI from Concept to Reality: Five Keys to Implementing AI for Know...
 
Fintech workshop Part I - Law Society of Hong Kong - Xccelerate
Fintech workshop Part I - Law Society of Hong Kong - XccelerateFintech workshop Part I - Law Society of Hong Kong - Xccelerate
Fintech workshop Part I - Law Society of Hong Kong - Xccelerate
 
Data-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data ModelingData-Ed Online: Trends in Data Modeling
Data-Ed Online: Trends in Data Modeling
 
Data-Ed: Trends in Data Modeling
Data-Ed: Trends in Data ModelingData-Ed: Trends in Data Modeling
Data-Ed: Trends in Data Modeling
 

More from Applause

Digital Healthcare Panel: Exploring the Digital Opportunities and Obstacles f...
Digital Healthcare Panel: Exploring the Digital Opportunities and Obstacles f...Digital Healthcare Panel: Exploring the Digital Opportunities and Obstacles f...
Digital Healthcare Panel: Exploring the Digital Opportunities and Obstacles f...
Applause
 
How to Go Codeless for Automated Mobile App Testing
How to Go Codeless for Automated Mobile App TestingHow to Go Codeless for Automated Mobile App Testing
How to Go Codeless for Automated Mobile App Testing
Applause
 
Best Practices for a Repeatable Shift-Left Commitment
Best Practices for a Repeatable Shift-Left CommitmentBest Practices for a Repeatable Shift-Left Commitment
Best Practices for a Repeatable Shift-Left Commitment
Applause
 
Weber’s Journey: How a Top Grill Maker Serves Up Connected Cooking
Weber’s Journey: How a Top Grill Maker Serves Up Connected CookingWeber’s Journey: How a Top Grill Maker Serves Up Connected Cooking
Weber’s Journey: How a Top Grill Maker Serves Up Connected Cooking
Applause
 
Boost Your Intelligent Assistants with UX Testing
Boost Your Intelligent Assistants with UX TestingBoost Your Intelligent Assistants with UX Testing
Boost Your Intelligent Assistants with UX Testing
Applause
 
Delivering Excellent Digital Quality to Your Customers
Delivering Excellent Digital Quality to Your CustomersDelivering Excellent Digital Quality to Your Customers
Delivering Excellent Digital Quality to Your Customers
Applause
 
Tackling Strategic Engineering Challenges
Tackling Strategic Engineering ChallengesTackling Strategic Engineering Challenges
Tackling Strategic Engineering Challenges
Applause
 
Validate Your Redefined Customer Journeys Quickly
Validate Your Redefined Customer Journeys QuicklyValidate Your Redefined Customer Journeys Quickly
Validate Your Redefined Customer Journeys Quickly
Applause
 
The State of Voice with U.S. Bank and Voicebot.ai
The State of Voice with U.S. Bank and Voicebot.aiThe State of Voice with U.S. Bank and Voicebot.ai
The State of Voice with U.S. Bank and Voicebot.ai
Applause
 
Get strategic with qa in dev ops
Get strategic with qa in dev opsGet strategic with qa in dev ops
Get strategic with qa in dev ops
Applause
 
The Good, The Bad, & The Ugly of Voice
The Good, The Bad, & The Ugly of VoiceThe Good, The Bad, & The Ugly of Voice
The Good, The Bad, & The Ugly of Voice
Applause
 
From Padlocks to IoT: Master Lock's Keys to Digital Transformation
From Padlocks to IoT: Master Lock's Keys to Digital TransformationFrom Padlocks to IoT: Master Lock's Keys to Digital Transformation
From Padlocks to IoT: Master Lock's Keys to Digital Transformation
Applause
 
Accessibility is More Than a Compliance Checkbox
Accessibility is More Than a Compliance CheckboxAccessibility is More Than a Compliance Checkbox
Accessibility is More Than a Compliance Checkbox
Applause
 
The Essentials to Successful User-Centric Development
The Essentials to Successful User-Centric DevelopmentThe Essentials to Successful User-Centric Development
The Essentials to Successful User-Centric Development
Applause
 
Lessons Learned from the Holiday Shopping Season
Lessons Learned from the Holiday Shopping SeasonLessons Learned from the Holiday Shopping Season
Lessons Learned from the Holiday Shopping Season
Applause
 
6 Secrets to Omnichannel and Digital Success
6 Secrets to Omnichannel and Digital Success6 Secrets to Omnichannel and Digital Success
6 Secrets to Omnichannel and Digital Success
Applause
 
Solving the Top Problems of Voice Testing
Solving the Top Problems of Voice TestingSolving the Top Problems of Voice Testing
Solving the Top Problems of Voice Testing
Applause
 
NRF 2019: 5 Secrets to Omnichannel and Retail Success
NRF 2019: 5 Secrets to Omnichannel and Retail SuccessNRF 2019: 5 Secrets to Omnichannel and Retail Success
NRF 2019: 5 Secrets to Omnichannel and Retail Success
Applause
 
Testing Essentials for the Age of Agile
Testing Essentials for the Age of AgileTesting Essentials for the Age of Agile
Testing Essentials for the Age of Agile
Applause
 
From Padlocks to IoT: Master Lock's Keys to Digital Transformation
From Padlocks to IoT: Master Lock's Keys to Digital TransformationFrom Padlocks to IoT: Master Lock's Keys to Digital Transformation
From Padlocks to IoT: Master Lock's Keys to Digital Transformation
Applause
 

More from Applause (20)

Digital Healthcare Panel: Exploring the Digital Opportunities and Obstacles f...
Digital Healthcare Panel: Exploring the Digital Opportunities and Obstacles f...Digital Healthcare Panel: Exploring the Digital Opportunities and Obstacles f...
Digital Healthcare Panel: Exploring the Digital Opportunities and Obstacles f...
 
How to Go Codeless for Automated Mobile App Testing
How to Go Codeless for Automated Mobile App TestingHow to Go Codeless for Automated Mobile App Testing
How to Go Codeless for Automated Mobile App Testing
 
Best Practices for a Repeatable Shift-Left Commitment
Best Practices for a Repeatable Shift-Left CommitmentBest Practices for a Repeatable Shift-Left Commitment
Best Practices for a Repeatable Shift-Left Commitment
 
Weber’s Journey: How a Top Grill Maker Serves Up Connected Cooking
Weber’s Journey: How a Top Grill Maker Serves Up Connected CookingWeber’s Journey: How a Top Grill Maker Serves Up Connected Cooking
Weber’s Journey: How a Top Grill Maker Serves Up Connected Cooking
 
Boost Your Intelligent Assistants with UX Testing
Boost Your Intelligent Assistants with UX TestingBoost Your Intelligent Assistants with UX Testing
Boost Your Intelligent Assistants with UX Testing
 
Delivering Excellent Digital Quality to Your Customers
Delivering Excellent Digital Quality to Your CustomersDelivering Excellent Digital Quality to Your Customers
Delivering Excellent Digital Quality to Your Customers
 
Tackling Strategic Engineering Challenges
Tackling Strategic Engineering ChallengesTackling Strategic Engineering Challenges
Tackling Strategic Engineering Challenges
 
Validate Your Redefined Customer Journeys Quickly
Validate Your Redefined Customer Journeys QuicklyValidate Your Redefined Customer Journeys Quickly
Validate Your Redefined Customer Journeys Quickly
 
The State of Voice with U.S. Bank and Voicebot.ai
The State of Voice with U.S. Bank and Voicebot.aiThe State of Voice with U.S. Bank and Voicebot.ai
The State of Voice with U.S. Bank and Voicebot.ai
 
Get strategic with qa in dev ops
Get strategic with qa in dev opsGet strategic with qa in dev ops
Get strategic with qa in dev ops
 
The Good, The Bad, & The Ugly of Voice
The Good, The Bad, & The Ugly of VoiceThe Good, The Bad, & The Ugly of Voice
The Good, The Bad, & The Ugly of Voice
 
From Padlocks to IoT: Master Lock's Keys to Digital Transformation
From Padlocks to IoT: Master Lock's Keys to Digital TransformationFrom Padlocks to IoT: Master Lock's Keys to Digital Transformation
From Padlocks to IoT: Master Lock's Keys to Digital Transformation
 
Accessibility is More Than a Compliance Checkbox
Accessibility is More Than a Compliance CheckboxAccessibility is More Than a Compliance Checkbox
Accessibility is More Than a Compliance Checkbox
 
The Essentials to Successful User-Centric Development
The Essentials to Successful User-Centric DevelopmentThe Essentials to Successful User-Centric Development
The Essentials to Successful User-Centric Development
 
Lessons Learned from the Holiday Shopping Season
Lessons Learned from the Holiday Shopping SeasonLessons Learned from the Holiday Shopping Season
Lessons Learned from the Holiday Shopping Season
 
6 Secrets to Omnichannel and Digital Success
6 Secrets to Omnichannel and Digital Success6 Secrets to Omnichannel and Digital Success
6 Secrets to Omnichannel and Digital Success
 
Solving the Top Problems of Voice Testing
Solving the Top Problems of Voice TestingSolving the Top Problems of Voice Testing
Solving the Top Problems of Voice Testing
 
NRF 2019: 5 Secrets to Omnichannel and Retail Success
NRF 2019: 5 Secrets to Omnichannel and Retail SuccessNRF 2019: 5 Secrets to Omnichannel and Retail Success
NRF 2019: 5 Secrets to Omnichannel and Retail Success
 
Testing Essentials for the Age of Agile
Testing Essentials for the Age of AgileTesting Essentials for the Age of Agile
Testing Essentials for the Age of Agile
 
From Padlocks to IoT: Master Lock's Keys to Digital Transformation
From Padlocks to IoT: Master Lock's Keys to Digital TransformationFrom Padlocks to IoT: Master Lock's Keys to Digital Transformation
From Padlocks to IoT: Master Lock's Keys to Digital Transformation
 

Recently uploaded

Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
ScyllaDB
 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
Larry Smarr
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
ThousandEyes
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
anilsa9823
 
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to SuccessMongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
DianaGray10
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Real-Time Persisted Events at Supercell
Real-Time Persisted Events at  SupercellReal-Time Persisted Events at  Supercell
Real-Time Persisted Events at Supercell
ScyllaDB
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
ScyllaDB
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
Enterprise Knowledge
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes
 
ScyllaDB Real-Time Event Processing with CDC
ScyllaDB Real-Time Event Processing with CDCScyllaDB Real-Time Event Processing with CDC
ScyllaDB Real-Time Event Processing with CDC
ScyllaDB
 
Multivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back againMultivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back again
Kieran Kunhya
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc
 

Recently uploaded (20)

Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
Call Girls Chennai ☎️ +91-7426014248 😍 Chennai Call Girl Beauty Girls Chennai...
 
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to SuccessMongoDB to ScyllaDB: Technical Comparison and the Path to Success
MongoDB to ScyllaDB: Technical Comparison and the Path to Success
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Real-Time Persisted Events at Supercell
Real-Time Persisted Events at  SupercellReal-Time Persisted Events at  Supercell
Real-Time Persisted Events at Supercell
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
 
ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024ThousandEyes New Product Features and Release Highlights: June 2024
ThousandEyes New Product Features and Release Highlights: June 2024
 
ScyllaDB Real-Time Event Processing with CDC
ScyllaDB Real-Time Event Processing with CDCScyllaDB Real-Time Event Processing with CDC
ScyllaDB Real-Time Event Processing with CDC
 
Multivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back againMultivendor cloud production with VSF TR-11 - there and back again
Multivendor cloud production with VSF TR-11 - there and back again
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
 

Scaling Training Data for AI Applications

  • 1. Scaling Training Data for AI Applications Ron Schmelzer and Kathleen Walch Principal Analysts, Cognilytica Kristin Simonini VP of Product, Applause
  • 2. Today’s Speakers 2 Ron Schmelzer Principal Analyst, Cognilytica Kathleen Walch Principal Analyst, Cognilytica Kristin Simonini VP of Product, Applause
  • 3. 3 Today’s Agenda • MakingAI a Reality • The Seven Patterns of AI, andWhat RequiresTraining Data • Leveraging a Global Community to SourceTraining Data • Real Example of Overcoming Challenges of a SourcingTraining Data Project
  • 4. • Cognilytica is an AI & Cognitive Technology-focused research and advisory firm. • Produce market research, advisory and guidance on AI, ML, and CognitiveTechnology • Produce the popular AIToday podcast, in addition to infographic series, whitepapers, webinars, newsletters, and other popular content. • Focused on enterprise and public sector adoption of AI, ML, and Cognitive Technology • Kathleen Walch and Ron Schmelzer are PrincipalAnalysts and Managing Partners of Cognilytica • Contributing writers to Forbes,TechTarget (SearchEnterpriseAI), Cognitive World, and CTOVision About Cognilytica 4
  • 5. • Data is the heart, soul, juju, of AI • The specific data you need depends on the business problem you’re solving and the kinds of predictive or goal outcomes you’re looking for • Activities for data collection: • Identifying the required data on which to train • Identifying all the dimensions required for that data for predictive value of significance • Identifying the features that are required • Identifying the sources of data • Identifying the means to aggregate that data • There is no exact answer to the question “How much data is needed?” Identifying Data Sets for ML: Data Collection 5
  • 6. Making AI a Reality 6 The Seven Patterns of AI
  • 7. • Machines and humans interacting with each other using natural language, conversational forms of interaction across a variety of forms of communication including voice, text, and written, and image forms. • The objective of this pattern is machines interacting with humans the way humans interact with each other. The Conversation & Human Interaction Pattern 7
  • 8. • Using ML to identify and understand images, sound, items, handwriting, faces, and gestures. • The objective of this pattern is to have machines identify and understand the real world and unstructured data. The Recognition Pattern 8
  • 9. • Using machine learning and other cognitive approaches to understand how to take past / existing behavior and predict future outcomes or help humans make decisions about future outcomes using insight learned from past behavior / interactions / data. • The Objective of this pattern is helping humans make better decisions Predictive Analytics & Decision Support 9
  • 10. • Machine learning (esp. Deep Learning) is good at recognizing patterns • If you can train it, you can detect it • If you can train it, you can detect patterns… or things that don’t fit patterns Pattern & Anomaly Detection 10
  • 11. • Physical and virtual (software) systems that are able to accomplish a task, achieve a goal, interact with their surroundings, and perform their objective with minimal or any human involvement. • The objective of this pattern is minimizing human labor Autonomous Systems 11
  • 12. • In order for Supervised Learning approaches to work, they must be fed clean, well-labeled data that the system can use to learn from example. • But how do you get Labeled Data? • Do it yourself • Find a source of already labeled data • Get your Users to Do it • Hire a Contractor Workforce • Contract withThird Party Data Labeling Firms Data Labeling: The Achilles Heel of AI 12
  • 13. The Data Preparation & Engineering Pipeline Data Acquisition / Ingest / Capture • ETL • Cloud-based data Merging • Combining data sources Cleaning • Deduping, removing extraneous, bad data Labeling • Adding machine learning labels and annotations for training purposes Enhancing • Adding necessary additional data for models Filtering • Eliminating bias Feature Engineering • Assisting with enhancement (see future on multiplying data sets) Retraining Pipelines • Creation of pipelines to deal with model iteration
  • 14. World’s Largest Community Of Vetted Digital Professionals 14 Available in real-time and selected to represent your customers. Custom, Vetted Testing & Feedback Teams Any demographic, device, and region to achieve your specific needs
  • 15. Applause for AI: An End-to-End Solution MACHINE- LEARNING ALGORITHM Did it understand me? Did I see or hear what I expected? Did it respond accurately? Were the recommend- ations relevant? Was the information captured correctly? Was it easy to use? Speech Video Training Data Testing Output Text Questions Handwriting Images
  • 16. The Challenge: Sourcing Data for AI 16 • 81% of executives said training AI with data is more difficult than expected • Main challenges included biased or erroneous data, not enough data, or inability to label data. • 60% of decision makers at firms adopting AI cite data quality as either “challenging” or “very challenging.” (IDC) • “Regardless of your beginner or expert AI status, data is the Debbie Downer of any AI project.” (Forrester)
  • 17. What we see in the Enterprise: • You need LOTS of training data: Thousands to tens of thousands of artifacts: Images,Videos, Documents, Voice/Dialects • You need QUALITY data, not just volume: Poor data results in costly delays to the Product Development Lifecycle • You need a DIVERSE, global community of testers: Gender, Age, Race, Language are must haves for today’s AI applications. You can’t have one individual provide 100s of artifacts, you need 100s of testers to provide single artifacts • You need to be able to rapidly EVOLVE: As Product team’s train the algorithm, they often need to change their sourcing requirements if they are not getting an expected output. The Challenge: Sourcing Data for AI 17 Quantity Diversity Quality
  • 18. How Applause Solves………. 18 Sourcing Quality Data at Scale Leveraging a vetted community of over 400,000 testers in 200+ countries enabling Applause to deliver a seamless sourcing solution that includes:  Quality vs.Volume: We build agreements focused on usable data vs. simple data collection  Managed Service: End to End program that includes recruitment, quality control, delivery, tester training  Privacy and Security: Seamlessly manages the complex Privacy landscape, including PII, HIPAA,GDPR and unique company confidential requirements that may be required  Elastic and Scalable: Unique business model enabling companies to rapidly supporting evolving product and business requirements
  • 19. HOW DOES THIS LOOK IN PRACTICE?
  • 20. How to Source Training Data Use Case 20 Requirement: Source thousands of real-world handwritten documents • Blind collection with no PII data • No one individual could submit more than a single document • Minimum density required: Words per page
  • 21. Challenge: Recruit a High Number of Diverse Participants 21 • Training Data required thousands of pages of real handwriting across a variety of documents and personal artifacts, including (but not limited to): • Prescriptions/doctors notes • Purchase orders • Credit applications • Personal essays and letters • Drivers licenses and birth certificates from all 50 states • Tax Forms • Each handwriting sample had to be unique and could not be replicated across types or groups • The Applause service and platform is built to recruit and incentivize thousands of testers to deliver documents with specific requirements, such as word density and redaction of all personal information
  • 22. Challenge: Extremely specific requirements 22 • On top of unique testers, there was a requirement for unique forms with specific requirements • Tax Forms required a diversity of different types:W-2, Pay stubs, IRS 1098-T, IRS 1099-R, IRS 1099-DIV, and others • Each document had specifications • No more than 1 single folded margin in the middle • No deformations on the page • Minimum number of words per page • Each document needed to be authentic, but with minimal redactions • Automation only gets you so far. You need a proven QA andValidation process that is staffed by an experienced team to check multiple requirements and dependencies
  • 23. Challenge: Meeting Privacy and Confidentiality requirements 23 • Sourcing training data for AI Applications means they are typically in “development” and the collection process needs to meet stringent confidential requirements • Privacy laws and policies need to be accounted across different states, countries and regulatory • The Applause process and service ensures that sourcing can be blind to the testers to protect confidentiality while also insuring documents are redacted to account for all relevant laws, such as GDPR, HIPAA, PII. This includes replacing sensitive data with “dummy” data as needed.
  • 24. Things to Consider 24  Diversity of testers  Privacy concerns  Recruit and train participants  Ensure quality data  Execute this at scale  Evolve as your needs change
  • 25. Q&A

Editor's Notes

  1. Sarah runs through the agenda
  2. If possible, mention that Applause can source data
  3. Click for animation. Kristin to start from the end of this slide.
  4. The size, breadth, and quality of our community is what enables us to deliver immense value to our clients. Our community has several hundred thousand testers. Each member of the community is carefully vetted (profile, nda, assessments, courses) to make sure feedback is provided in a detailed and concise manner. The community is diverse, with QA professionals, usability experts, and people with no technical background (average joe off the street) – so you get the right type of feedback. If you need access to someone in England with a certain type of credit card, we can do that In the past year, the community submitted a million pieces of feedback (bug reports, test cases, completed usability surveys, etc.) – that’s over 2,700 a day. We are doing this at-scale for the worlds largest brands.
  5. On the right-hand side of this slide, you can see the ‘Testing Output’ portion. This is something that Applause has been doing for years and years. In the last couple years, we’ve identified another area that really only Applause, and a globally managed vetted community, can help with, and that’s providing quality training data at scale. So you can see the different types of data that Applause can source from the global community, from handwriting and text to speech and video. We’ll talk a lot more in this presentation about some of the work we’ve done with sourcing handwriting to train an AI algorithm for how to read handwriting.
  6. When we’re talking about sourcing data, there are some major challenges out in the market. A lot of organizations start to go down this path, and then realize it’s actually much more challenging than they might’ve realized. Sourcing training data on your own is, to be frank, extremely challenging at best, and possibly outright impossible. You might not have access to the # of people you need. Even if you do have access to that # of people, you need to ensure you’re getting quality data. If you do get quality data, you need a team to annotate and label the data. And even if you do all that, you need to think about diversity and being able to evolve over time. It’s a massive challenge of logistics and overhead.
  7. The “Wheel” of challenges include: Data quality (bias, errors) Lack of quantity Diversity You need thousands of artifacts to properly train an AI algorithm. For example, we recently did some work with BBC to train their voice assistant, Beeb, and the algorithm required over 105,000 voice utterances, which Applause provided for BBC. But of course, having a lot of data is pretty much worthless if it’s poor quality. If your data isn’t labeled correctly, or if it’s not in the right format to begin with, it can delay your project and sometimes is completely useless, depending on the data type Diversity is the third element. If you’re building an AI algorithm, you don’t want to rely on 1 single person to provide the artifacts. That’s not going to lead to a strong AI output. So getting data from not just a lot sources, but a wide variety of sources, is impactful And of course, these projects evolve, and you need to be able to evolve with them
  8. So we talk about the challenges – how can Applause solve for them? Our system can produce usable data and follow some pretty strict requirements. Limited overhead  we’re a white-glove service and our internal teams manage the recruitment of data providers, we thoroughly evaluate the data artifacts we get from a source, and we train our testers to follow your requirements strictly. And we haven’t mentioned yet, but Privacy and Security is a major element to consider. There are compliance laws to consider, such as GDPR. Applause works within those confines to ensure confidentiality while also providing useful data for a customer And then elasticity and scalability, our model can shift as a customer’s requirements change, which can happen quite a bit with AI projects
  9. Let’s look at an example of a real customer, and how Applause sourced training data for them. Here are some of the challenges that come up with this kind of project.
  10. Want to share an example. We worked with an organization to build an algorithm that can read handwritten documents. So the idea is that you could scan a handwritten document, and the AI algorithm could read and understand the document. The software scans the form and identifies the keys and values. It detects the form field name. The content is the value -- even if it was filled in with a typewriter field, it might not be in the same place on every form. The software needs to understand the difference between the key and the value So how do you acquire the data that you need to train that algorithm? And what are some of the challenges that come up there? For one, we needed to source documents globally to acquire different: Handwriting styles Languages Other critical factors Example 1 is Amazon Amazon
  11. This was a project where the customer was looking for handwriting samples, to teach an algorithm to read handwriting. It needed thousands of handwriting samples to work, so again, there’s the quantity aspect coming up. But add this couldn’t just be 1 person submitting 500 or 1000 samples – that would’ve made this a lot easier to execute. For this project to work, each person could only submit 1 handwriting sample – in other words, we needed thousands of unique handwriting samples. Since each document had to come from a unique person, Applause had to recruit thousands of people – this is the kind of project that really only an organization like Applause can satisfy. We sourced well over 1000 folks in our community who were willing to provide handwritten documents. Why is an essay or letter valuable? It’s about handwriting recognition, so Applause asked our community for handwritten essays and letters. We had folks who were digging in their closet from 10 years ago that someone had written. We were looking for unique handwriting samples. We even had someone once ask for SAT and ACT essays, but obviously this wasn’t something we could provide
  12. In addition to getting a lot of people, Applause was being asked to produce a lot of different types of documents. So you can see, the tax forms, we needed to provide a lot of different types. We had a team that could manage this and ensure we were bringing in the diversity of documents we needed to help the algorithm. We needed at least 50 W-2 forms, 50 IRS 1098-T forms, etc. And then the requirements of the documents themselves, this gets into the “quality” of the data  no deformations on the page, the page can have no more than 1 single folded margin in the middle, and there were several specifications for that. And you’re scanning these documents, so they need to be in good light conditions or the flash is used in dark settings Redactions  the testers had to put in their own dummy data to protect PII There’s a lot of overhead that comes with this project, especially at scale. So having a team that can not only manage this project, but knows what to look for, is really crucial to success.
  13. And finally, privacy is a major concern here. We’ve got GDPR, HIPAA that you need to consider. So by giving a company a completed tax document or healthcare form, you could be violating some laws or opening yourself up to a lawsuit. So this is something that if you’re trying to do this on your own, there’s a lot of overhead  you can imagine a team of 10+ people having to work around the clock for weeks if not months to remove PII and ensure confidentiality. Here at Applause, we have processes in place where we can protect confidentiality. And in this case, we instructed testers to fill out the forms with dummy data. That way, the organization is still getting the handwriting sample, but there’s no sacrifice of PII.
  14. Quantity Need hundreds, if not thousands of individuals to make this work Diversity Requires many different types of data (geographic, document type, etc.) to properly train the algorithm Privacy and confidentially Need process, dedicated resources to ensure privacy and not violating GDPR, exposing PII Sourcing often needs to be blind and account for the nature of the product Process and sustainable model Sourcing training data for AI at scale is a major undertaking  you need a team that is wholly dedicated to delivering on this project
  翻译: