Introducing Compreno - Natural Language Processing Technology

1
ABBYY Compreno
Driving Impact from
Unstructured Information
Analytics
<NAME>
<DATE>

ABBYY Worldwide
2
Global
16 offices with more than 1.250 employees
in Europe, USA, Asia, Australia und Russia
Innovative
27% revenue investment in R&D,
more than 400 developers and scientists
Reliable
Connected
Trusted partner to over 1000 companies in
more than 150 countries around the world
Successful
> 40 million software users process more than
9 billion pages per year with ABBYY products
Enabling
Recognise, capture, (translate), analyse –
we transform information into action
Strong and independent core technology that
evolves with the needs of the digital revolution

Digital Universe
2.5 Exabyte of data generated every day = 2.5 Mio Terabyte = 2.5 x 1018 Byte
(source: Northwestern University, 2016)
 Majority (ca. 80%) is unstructured
3
1.4 x 1014 Word pages
3.5 x 1013 PPT slides
2 x 1013 PDF pages (image & text)
2 x 1014 emails
4 x 1013 scanned pages
3 x 1013 images (.tiff)
1.4 x 1016 .txt files
(source for average file sizes: netdocuments.com, 2016)
Reports, brochures, datasheets, presentations,
research documents, service documents,
pricelists, process descriptions, project
descriptions, product feature specifications,
customer communication, accident/security
reports, contracts, email, web texts, articles in
magazines, complete intranets …

Unstructured Content I
What do unstructured documents have in common?
● They are composed in natural language
What is the problem about natural language?
● Complex to analyse and summarise
● Does have a structure but is not standardized (different people use different terms, expressions, syntax to talk
about the same thing)
● Content is unexpected and cannot be processed with rules
● Limited/no metadata
4

Unstructured Content II
● The computer does not know what the document is about and there is no source to
get this information from
● Information is “locked” within documents
● Information that may be valuable, or confidential, business-critical, or defensibly deletable, but is
difficult to find and manage
 There is no business value in content that can’t be analysed or found
Natural language requires dedicated processing technology
5

ABBYY Compreno
6Confidential
What is it? Natural Language Processing (NLP) technology
What does it do? Advanced automated text analysis
● Gathers information about a document from the document
● Understands meaning of words within context
● Reveals relationships between words
● Builds stories across documents
● Extracts insights and intelligence from unstructured text

How Compreno works
Key Components
7
Semantics
Semantic analysis is used
to interpret syntactic structures
in terms of universal,
language-independent
concepts and their relations.
Syntax
Identifies formal relations among
words in a sentence or across
several sentences.
The system analyzes a text
and builds a tree of syntactic
relations.
Statistics
Data gleaned from parallel and monolingual corpora are
used for training the analysis algorithms and verifying and
expanding the formal descriptions available to the system.
Semantics
Syntax
Statistics

ABBYY Compreno
Platform for document understanding
Core uses of Compreno technology
● Classify unstructured documents
● Identify and extract entities, facts and
events from texts
8Confidential

What is classification?
To go from …
9

10
Mammals Birds Reptiles Fish
What is classification?
… to
Categorisation based on particular shared features

How document classification works
Three main steps
11
Training
Set up model, define categories,
select/collect training documents, train
model, choose best algorithm
Test and tune
Analyse test results, eliminate
mistakes, adjust training set,
retrain model
Classification
Deploy model to
production, classify
documents

Document classification – Why?
12
Essential step in information management
Enable advanced analysis and decision-
making
Generate business value

Why is classification not as easy as it seems?
Building up a reliable classification workflow is difficult…
13
Big Content
Technical challenges
- Big training sets
- Complex algorithms
- Difficult to integrate
Business challenges
- Traditional classification
methods don’t do the job
- High investments for
building and maintaining
the rule sets and
classification schemes
required (classification
expert knowledge)
New,
dedicated
processing
methods
required
Unstructured documents

ABBYY Smart Classifier
● Text classification module for organising unstructured documents
● Assign unseen documents to predefined categories based on statistical,
morphological and semantic analysis
● Uses supervised machine learning to produce a classification model from sample
inputs
● Classification creates meta data derived from the document context
14
Next generation document classification

Unstructured information processing
● Unlock information
● Make content searchable, accessible and retrievable
Automated classification
● High speed
● Constant quality
● No manual work
Semantic-based classification
● Deep text analysis techniques employed for even more accurate classification
15
Smart Classifier features and values

Machine learning
● System learns automatically based on the training documents
● No particular knowledge required to setup classification
● No specification of rules necessary
● Small training sets
Automatic algorithm optimisation
● Selection of the best-performing algorithm for each document set
16

Simple UI
● No specific knowledge required to create a model, train the system and launch a
classification workflow
Input document formats and languages
● Process content regardless of original format
● OCR for processing of images
● 39 classification languages
17

IT Integration of Smart Classifier
Leverage existing systems and infrastructure
18

Smart Classifier Workflows
19
Create and deploy classification model
01 | Category definition and selection of sample documents
02 | Setup of classification model
03 | Model training
04 | Model testing, quality evaluation and tuning
05 | Deployment to production
Document classification workflow

01| Category definition and selection of sample documents
● Category = a group of documents that have particular shared features
● Category definition is a management decision, no special IT skills required
● Content and process experts select representative documents for each category
● Minimum: 10 documents per category
● For reliable statistics: ±100 documents per category
● Representative sample of documents
● Documents must be typical for category: The more representative of the respective
category a document is, the better the model will perform (garbage in, garbage out).
● Proportion of docs assigned to each category should be the same as in the collection of
documents to be classified
● Smart Classifier accepts many formats (plain text, Office, HTML, XML, PDFs
(Image formats are submitted to OCR))
● Folder structure: Each (sub-)category = dedicated (sub-) folder
● Create training set and control set and save them as ZIP files
20

02| Setup of classification model
● The Classification Model defines, how and by which categories document
classification will be performed.
● Model creation via Model Editor web UI or REST API (code samples included in
documentation)
● Set parameters
● Document language (39 languages supported)
● Category assignment (what category will be assigned to the document if more than one was
returned as candidate category)
● Quality criteria (trade-off between precision and recall)
21

02| Setup of classification model
Model Editor web interface
22

03| Model training
● Load training documents
● Train classification model
● Machine learning
● The system automatically
identifies and uses the most
relevant features from the
training documents for
creating the classification
model
23

04| Model testing, quality evaluation and tuning
● Load and test control set to determine whether training process was successful
● Classification results in control set must meet expectations before model can be deployed
● Model Editor provides instant visibility of each document within a classification
project
● Source text and key words picked by the algorithms can be analysed and checked
● Terms that should be ignored during classification can be added to a stop word list
● Analyse: F-measure, precision, recall
● Debug: Confidence level, selected keywords
● Adjust: Inclusiveness, stop words, documents in classes (re-assign category)
● Upload further training/control documents
24

04| Model testing, quality evaluation and tuning
25

05| Deployment to production
● When the model is deployed it
becomes available via the
Compreno REST API
● If you make changes to the model,
it needs to be retrained for changes
to become effective
26Confidential

27Confidential
Once the system is set up and a classification model is published for operation, incoming
classification tasks will be accepted
01| A new document classification task is created
02| The document is converted into an internal format
03| The document is classified
04| The document classification results are saved
05| The task is completed

28Confidential
● Classification results in Model Editor

Smart Classifier application scenarios
Enterprise content management and its subdomains
Archiving, records management (Information Governance), document management,
enterprise search
● Classification of incoming and stored documents
● Definition of category-based access rights and retention policies
● Search enhancement
29

Information lifecycle
Manage
Store
Archive
Dispose
Create
Capture
30
Classification of incoming
documents
Add documents to the
system that have a value, i.e.
are enhanced with metadata
Classification for aid in risk mitigation
Category-based document access
rights
Category-based disposal
policy
Classification for aid in compliance
Category-based retention policy
Classification to improve enterprise
search systems
Add class to search index
Category-based routing and distribution
Post-process
• Classification for metadata correction
• Classification of legacy content for data
improvement

Data migration
● Organise content before, during or after migration
Client support
● Category-based prioritisation and routing of client issues shorten response times
eDiscovery
● Quickly gather and prepare documents
Mailroom
● Automatically select the most suitable processing workflow
E-mail management
● Additional metadata facilitates and accelerates routing
31

Smart Classifier benefits
For all enterprises
Create access to
information
Efficient
information
management
Aid compliance &
risk mitigation
Cost efficiency
32

For ISVs
Create better
customer applications
Quick ROI
33

For BPOs
Accelerate business
processes
34
Easier cost calculation

ABBYY Compreno
Platform for document understanding
Core uses of Compreno technology
● Classify unstructured documents
● Identify and extract entities, facts and
events from texts
35Confidential

ABBYY InfoExtractor SDK
● Information extraction module for processing natural language texts
● Natively processes unstructured documents and accesses the embedded textual
information
● Identifies different facts, entities and the relationship between them
● Automatically extracts critical data
● Combines related data into facts
36Confidential

How InfoExtractor works I
From text to semantics
Syntactic parsing: Determine the structure of the input text; understand how concepts relate to one another
within one or more sentences
Semantic parsing: Contextual analysis = Obtaining and representing the meaning of a
sentence
Universal Semantic Hierarchy: Language
independent hierarchy of concepts to reflect the
meaning and relations of words and sentences
Derive meaning of sentence by
understanding of the context and the
“speaker's” intent.
An ontology is a formal representation of
concepts and the relationships between
those concepts.
Lexical analysis: Convert sequence of characters into sequence of words
Morphological analysis: Analyse the structure of words and parts of words

Connect entities with other entities and facts, even if the words that define them are replaced with
pronouns or omitted in the text
Example: The company has denied reports it is preparing to default on its loans if it cannot reach
agreement on its bailout terms with international creditors
38
How InfoExtractor works II
Identify relationships between words
Get the
complete story

39
Gather only
relevant facts
How InfoExtractor works III
Define the contextual meaning of a word

Example: Some people work with PDF documents but not all employees do.
40
Don’t miss any
valuable facts
How InfoExtractor works IV
Detect omitted words

InfoExtractor features and values
41Confidential
Natural Language Processing
● Understand the meaning of words and relations between them
Extraction of entities and events
● Extract the facts and story lines embedded in unstructured information
● Persons, organisations, dates
● Deals, purchases, employment details
Identify relationships between entities and events
● Contracting parties, subject of the contract, financial figures

InfoExtractor features and values
Basic and custom ontologies
● Basic ontologies including widely used words
● Custom ontologies for industry solutions
Customized entities for specific cases
● Custom ontology dictionaries to extract complicated examples of entities (e.g. Asian
names or companies)
Input document formats and languages
● Work with text regardless of source
● English, Russian, German
● OCR embedded for image processing
42Confidential

IT Integration of InfoExtractor

The information extraction process
44

InfoExtractor application scenarios
Contract Management
● Use Case: Mass contract ingestion
● Document Type: Contract
● Customer: ISVs, Service Providers
● Benefit: Extend service offering & increase revenues
Customer On-Boarding
● Use Case: Capture & upload customer information at point of entry into the system
● Document Type: Statuary documents, contracts
● Customer: Banks, insurance companies
● Benefit: Accelerate document processing
45

InfoExtractor application scenarios
Applicant Tracking
● Use Case: Tag and upload CVs to improve search
● Document Type: CV
● Customer: HR departments
● Benefit: Minimise resources required to process all the necessary CVs
Credit Risk Mitigation
● Use Case: Decide on providing loans; check various sources of information on potential loan customers.
● Document Type: Contracts, statuary documents, court decisions
● Customer: Banks
● Benefit: Accelerate document processing
46

InfoExtractor benefits
Get decision-critical information with less costs and efforts
Intelligence and
insights
Aid predictive
decision making
Uncover hidden
risks
Cost efficiency
47
Use analytics to create
new value out of existing
and new data
Get the big picture by
connecting entities, facts
and events across
documents
Accelerate and automate
content upload and
analysis to optimise
manual processes
Take critical decisions
faster based on relevant
information

48
Good classification and information
extraction let organisations solve
tasks they are not capable of solving
at the moment
Smart Classifier and InfoExtractor
make document classification and
information extraction simple
Summary

Licensing
● Smart Classifier and InfoExtractor are available for testing via time and volume limited
trial license
● Different license models
● Perpetual with software maintenance
● Subscription (yearly)
● OEM licensing
● Standard license model based on renewable peak volume
● Backend can be scaled up
49

50
<Name>
<Name>@abbyy.com
ABBYY Europe GmbH
Elsenheimerstraße 49
80687 Munich
Germany
www.abbyy.com

Introducing Compreno - Natural Language Processing Technology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introducing Compreno - Natural Language Processing Technology

Similar to Introducing Compreno - Natural Language Processing Technology (20)

Recently uploaded

Recently uploaded (20)

Introducing Compreno - Natural Language Processing Technology

Editor's Notes