尊敬的 微信汇率:1円 ≈ 0.046239 元 支付宝汇率:1円 ≈ 0.04633元 [退出登录]
SlideShare a Scribd company logo
dr. sc. Marko Velić
Head of Styria Data Science
Enes Deumić
Senior Data Scientist
● Styria
○
○
● Styria Data Science Team
○
○
○
● Computer Vision
○
○
● NLP
○
● Users’ behavioral interests
○
○
● Combining NLP and CV research
● Tag Cleaning
● Tag Classification
● Contextual recommender
● Text generation
● Personalization - Overview
● Personalization - Deep Learning
Google Digital News Initiative (DNI)
“The DNI Fund has committed €150m over three years to the European news
ecosystem. It encourages new thinking in digital journalism, and gives news
organisations of all sizes the space to innovate and experiment.”
● “a personalised platform for regional news publishing”
● implemented in our custom CMS - Publitzer
● 750k EUR
● 3 main areas
○
■ merging, cleaning
■ recommendations
○
■ automatically find similar/relevant articles
○
■ personalized recommenders
■ personalized homepage
■ better experience
● in production on
24sata
○
○
○
● since early 2016
● CTR uplift 36.2%
● over 3.000.000 daily
API calls
● easy-to-embed
widget
● Motivation
○
○
○
● Solution - Tag recommender
○
○
○
● Motivation
○
■ Over 100k tags with only one appearance
■ Misspellings
■ Different names for the same entity (FC Dinamo, Dinamo, dinamo, CFC dinamo…)
■ Absurd tags: “two skeletons are riding a motorbike”
○
● Solution - tool for assisted tag merging - Sanitag
○
○
○
○
○
First results:
47658 tags merged to
7737 more relavant tags
● Problems
○
■ 130k tags with frequency 1
○
○
● Solution
○
■ Tag vector = mean(document vectors containing a given tag)
■ Tag “United States of America” and “USA” get similar representations
○
■ Tag vector = mean(fastText tag words)
■ Better than simple edit distance
● Different models
○
○
○
○
● Subword information improves
results
Misspelled input tag:
● fastText supervised classifier baseline
○
● Trained on ~460K articles with ~17K associated tags
○
○
● Arbitrary number of predicted labels k; experimental k=10
● Future work
○
○
● Quantitatively evaluated using recall
○
○
○
○
● Example of model prediction
● Motivation
○
● Similarity
○
○
■ C4, C5, Picasso …
■ Peugeot, Renault, Opel…
● Triplets marked by annotators
○
○
○
● Automated triplets
○
■ Title
■ Lead
■ Text
○ (article1, article2)
○ (title + lead of article1, text of article1, text of
article2)
Lead
Text
Title
Source: Andrew Ng CS294A lecture notes
● BOW input
● Can be deep
● Constraints on
embedding layer
● Hyperparameter
optimization
maximizing some
external metric
M. Dai, Andrew & Olah, Christopher & V. Le, Quoc. (2015). Document Embedding with Paragraph Vectors
PV-Distributed Memory PV-Distributed Bag of Words
● Triplet architecture
● Shared weights
● Temporal CNN for text
with self-attention
○
○
● Triplet margin loss
Deep metric learning using Triplet network
arXiv:1412.6622v3 [cs.LG], fig 1.
● Generate triplet dataset
○
■ Randomly sample an article
■ Calculate n = 15 most similar articles
■ Randomly select two numbers (i, j) in range (0, n - 1)
■ Positive example min(i, j), negative example max(i, j)
○
■ Positive example from the same child node
■ Negative example from sister node or different parent
○
■ Anchor title + lead from article A
■ Positive example, text (excluding title and lead) from A
■ Negative example, text from an article similar to A
● Create multiple batch generators
● Each batch
○
■ Category triplets
■ Doc2vec triplets
■ Anchor lead + title, anchor text, similar text
■ Anchor text, some close article, random article text
■ …
○
■ Depending on a triplet type, use different margins
■ Classification loss
● Predict category from embedding (cross-entropy)
● Predict tags from embeddings (multilabel sigmoid loss)
● Better generalization
Model Params Lead/text triplets Category triplets
Autoencoder* ndim=300 0.790* 0.700*
fastText
mean
ndim=256 0.911 0.687
Doc2Vec DM, ndim=300 0.969 0.650
Doc2Vec DM, ndim=32 0.883 0.687
Triplet CNN ndim=256 0.953 0.731
Triplet CNN ndim=32 0.927 0.721
* evaluation dataset not identical, but representative
● Possible improvements
○
■ Predict one or more words given paragraph from an article
■ Tags classification
○
■ Use language model to create word embeddings
● ELMo - Deep contextualized word representations
Text generation
● Hierarchical Neural Story Generation, by Facebook AI Research, Angela Fan,
Mike Lewis, Yann Dauphin (arXiv:1805.04833v1 [cs.CL])
● Story prompt:
○
● Story:
○
● Prompt:
○
● Generated story
○
Personalization
Explicit feedback
- user interaction with content has a numerical value representing how
much user enjoyed the content, for example movie ratings
- low rating user didn’t like the content
- high rating, user liked the content
- try to predict ratings for each item, recommend based on these
predictions
Implicit feedback
- user either interacted with the content, or did not.
- interaction probably means user is interested in the content
- lack of interaction is hard to interpret
user - item matrix
i1
i2
i3
i4
i5
u1
4 4 5
u2
1 3 3
u3
4 4 2 2
u4
2 4 3
u5
3 5 4
u6
3 5
Motivation
- user - item matrix can be very large, for example 1 000 000 users and
100 000 items,
- we can compare only the users who interacted on the same content,
- we would like a dense representation in low dimensional latent space.
Optimization objective
- (R)MSE
Observed rating Dot product of
user and item
Regularization
terms
Optimization objective
Confidence in
observed value. Binary indicator
r 0 or 1 for news
articles
Y. Hu, Y. Koren, C. Volinsky (2008). Collaborative Filtering for Implicit Feedback Datasets
● Requirements
○
○
○
○
● Model features
○
○
○
○
○
○
● Spotify personalization
○
○
○
○
● 24sata equivalent
○
○
○
○
● One model to rule them all
○
● Input data
○ | | | | | | | |query
● Output
○
● Model
○
■ Can take sequence of read articles (Temporal CNN or LSTM)
■ Can take huge amounts of data in batches
■ Learns very complicated interactions
■ We can model categorical data as embeddings
● Problems
○
○
○
● Solutions
○
○
■ If a user saw the article more than once and decided not to click, we are more
confident that user didn’t click because she was not interested, increase the weight
○
Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic

More Related Content

Similar to Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic

Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
ArangoDB Database
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Lviv Startup Club
 
HPEC 2021 sparse binary format
HPEC 2021 sparse binary formatHPEC 2021 sparse binary format
HPEC 2021 sparse binary format
ErikWelch2
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
Ganesh Venkataraman
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
Claudiu Coman
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
Why do we need TypeScript?
Why do we need TypeScript?Why do we need TypeScript?
Why do we need TypeScript?
Nitay Neeman
 
Machine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsMachine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better Recommendations
ChristopherWoodward16
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Jon Haddad
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
gustavosouto
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
Korea Sdec
 
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ArangoDB Database
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Anoop Deoras
 
Embedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking systemEmbedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking system
Marsan Ma
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
ArangoDB Database
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
Will Stanton
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Aaron Saray
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open Chemistry
Marcus Hanwell
 

Similar to Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic (20)

Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
HPEC 2021 sparse binary format
HPEC 2021 sparse binary formatHPEC 2021 sparse binary format
HPEC 2021 sparse binary format
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
Why do we need TypeScript?
Why do we need TypeScript?Why do we need TypeScript?
Why do we need TypeScript?
 
Machine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsMachine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better Recommendations
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica Coloft
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
ATO 2022 - Machine Learning + Graph Databases for Better Recommendations (3)....
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
Embedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking systemEmbedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking system
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open Chemistry
 

More from Institute of Contemporary Sciences

First 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip PanjevicFirst 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip Panjevic
Institute of Contemporary Sciences
 
Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...
Institute of Contemporary Sciences
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Institute of Contemporary Sciences
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Institute of Contemporary Sciences
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
Institute of Contemporary Sciences
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar Dilov
Institute of Contemporary Sciences
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Institute of Contemporary Sciences
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...
Institute of Contemporary Sciences
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Institute of Contemporary Sciences
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Institute of Contemporary Sciences
 
From Zero to ML Hero for Underdogs - Amir Tabakovic
From Zero to ML Hero for Underdogs  - Amir TabakovicFrom Zero to ML Hero for Underdogs  - Amir Tabakovic
From Zero to ML Hero for Underdogs - Amir Tabakovic
Institute of Contemporary Sciences
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
Institute of Contemporary Sciences
 
The price is right - Tomislav Krizan
The price is right - Tomislav KrizanThe price is right - Tomislav Krizan
The price is right - Tomislav Krizan
Institute of Contemporary Sciences
 
When it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela CulibrkWhen it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela Culibrk
Institute of Contemporary Sciences
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos Solujic
Institute of Contemporary Sciences
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir Brusic
Institute of Contemporary Sciences
 
Improving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity SearchImproving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity Search
Institute of Contemporary Sciences
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognition
Institute of Contemporary Sciences
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local government
Institute of Contemporary Sciences
 
Geospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and ClimateGeospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and Climate
Institute of Contemporary Sciences
 

More from Institute of Contemporary Sciences (20)

First 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip PanjevicFirst 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip Panjevic
 
Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar Dilov
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
 
From Zero to ML Hero for Underdogs - Amir Tabakovic
From Zero to ML Hero for Underdogs  - Amir TabakovicFrom Zero to ML Hero for Underdogs  - Amir Tabakovic
From Zero to ML Hero for Underdogs - Amir Tabakovic
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
 
The price is right - Tomislav Krizan
The price is right - Tomislav KrizanThe price is right - Tomislav Krizan
The price is right - Tomislav Krizan
 
When it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela CulibrkWhen it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela Culibrk
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos Solujic
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir Brusic
 
Improving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity SearchImproving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity Search
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognition
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local government
 
Geospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and ClimateGeospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and Climate
 

Recently uploaded

一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
gebegu
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
Startup Grind Princeton 18 June 2024 - AI Advancement
Startup Grind Princeton 18 June 2024 - AI AdvancementStartup Grind Princeton 18 June 2024 - AI Advancement
Startup Grind Princeton 18 June 2024 - AI Advancement
Timothy Spann
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
ThinkInnovation
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
ranjeet3341
 
Startup Grind Princeton - Gen AI 240618 18 June 2024
Startup Grind Princeton - Gen AI 240618 18 June 2024Startup Grind Princeton - Gen AI 240618 18 June 2024
Startup Grind Princeton - Gen AI 240618 18 June 2024
Timothy Spann
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
Health care analysis using sentimental analysis
Health care analysis using sentimental analysisHealth care analysis using sentimental analysis
Health care analysis using sentimental analysis
krishnasrigannavarap
 
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
mparmparousiskostas
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
newdirectionconsulta
 
MySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdfMySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdf
Ananta Patil
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
newdirectionconsulta
 
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
shivangimorya083
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
nitachopra
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
hiju9823
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
sapna sharmap11
 

Recently uploaded (20)

一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
一比一原版(sfu学位证书)西蒙弗雷泽大学毕业证如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
Startup Grind Princeton 18 June 2024 - AI Advancement
Startup Grind Princeton 18 June 2024 - AI AdvancementStartup Grind Princeton 18 June 2024 - AI Advancement
Startup Grind Princeton 18 June 2024 - AI Advancement
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
 
Startup Grind Princeton - Gen AI 240618 18 June 2024
Startup Grind Princeton - Gen AI 240618 18 June 2024Startup Grind Princeton - Gen AI 240618 18 June 2024
Startup Grind Princeton - Gen AI 240618 18 June 2024
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
Health care analysis using sentimental analysis
Health care analysis using sentimental analysisHealth care analysis using sentimental analysis
Health care analysis using sentimental analysis
 
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
 
MySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdfMySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdf
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
 
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
🔥Mature Women / Aunty Call Girl Chennai 💯Call Us 🔝 8094342248 🔝💃Top Class Cal...
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your DoorHyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
Hyderabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
 

Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic

  • 1. dr. sc. Marko Velić Head of Styria Data Science Enes Deumić Senior Data Scientist
  • 2. ● Styria ○ ○ ● Styria Data Science Team ○ ○ ○
  • 3.
  • 4. ● Computer Vision ○ ○ ● NLP ○ ● Users’ behavioral interests ○ ○ ● Combining NLP and CV research
  • 5. ● Tag Cleaning ● Tag Classification ● Contextual recommender ● Text generation ● Personalization - Overview ● Personalization - Deep Learning
  • 6. Google Digital News Initiative (DNI)
  • 7. “The DNI Fund has committed €150m over three years to the European news ecosystem. It encourages new thinking in digital journalism, and gives news organisations of all sizes the space to innovate and experiment.”
  • 8. ● “a personalised platform for regional news publishing” ● implemented in our custom CMS - Publitzer ● 750k EUR ● 3 main areas ○ ■ merging, cleaning ■ recommendations ○ ■ automatically find similar/relevant articles ○ ■ personalized recommenders ■ personalized homepage ■ better experience
  • 9. ● in production on 24sata ○ ○ ○ ● since early 2016 ● CTR uplift 36.2% ● over 3.000.000 daily API calls ● easy-to-embed widget
  • 10. ● Motivation ○ ○ ○ ● Solution - Tag recommender ○ ○ ○
  • 11. ● Motivation ○ ■ Over 100k tags with only one appearance ■ Misspellings ■ Different names for the same entity (FC Dinamo, Dinamo, dinamo, CFC dinamo…) ■ Absurd tags: “two skeletons are riding a motorbike” ○ ● Solution - tool for assisted tag merging - Sanitag ○ ○ ○ ○ ○
  • 12. First results: 47658 tags merged to 7737 more relavant tags
  • 13. ● Problems ○ ■ 130k tags with frequency 1 ○ ○ ● Solution ○ ■ Tag vector = mean(document vectors containing a given tag) ■ Tag “United States of America” and “USA” get similar representations ○ ■ Tag vector = mean(fastText tag words) ■ Better than simple edit distance
  • 14. ● Different models ○ ○ ○ ○ ● Subword information improves results Misspelled input tag:
  • 15. ● fastText supervised classifier baseline ○ ● Trained on ~460K articles with ~17K associated tags ○ ○ ● Arbitrary number of predicted labels k; experimental k=10 ● Future work ○ ○
  • 16. ● Quantitatively evaluated using recall ○ ○ ○ ○ ● Example of model prediction
  • 17. ● Motivation ○ ● Similarity ○ ○ ■ C4, C5, Picasso … ■ Peugeot, Renault, Opel…
  • 18. ● Triplets marked by annotators ○ ○ ○ ● Automated triplets ○ ■ Title ■ Lead ■ Text ○ (article1, article2) ○ (title + lead of article1, text of article1, text of article2)
  • 20. Source: Andrew Ng CS294A lecture notes ● BOW input ● Can be deep ● Constraints on embedding layer ● Hyperparameter optimization maximizing some external metric
  • 21. M. Dai, Andrew & Olah, Christopher & V. Le, Quoc. (2015). Document Embedding with Paragraph Vectors PV-Distributed Memory PV-Distributed Bag of Words
  • 22. ● Triplet architecture ● Shared weights ● Temporal CNN for text with self-attention ○ ○ ● Triplet margin loss Deep metric learning using Triplet network arXiv:1412.6622v3 [cs.LG], fig 1.
  • 23. ● Generate triplet dataset ○ ■ Randomly sample an article ■ Calculate n = 15 most similar articles ■ Randomly select two numbers (i, j) in range (0, n - 1) ■ Positive example min(i, j), negative example max(i, j) ○ ■ Positive example from the same child node ■ Negative example from sister node or different parent ○ ■ Anchor title + lead from article A ■ Positive example, text (excluding title and lead) from A ■ Negative example, text from an article similar to A
  • 24. ● Create multiple batch generators ● Each batch ○ ■ Category triplets ■ Doc2vec triplets ■ Anchor lead + title, anchor text, similar text ■ Anchor text, some close article, random article text ■ … ○ ■ Depending on a triplet type, use different margins ■ Classification loss ● Predict category from embedding (cross-entropy) ● Predict tags from embeddings (multilabel sigmoid loss) ● Better generalization
  • 25. Model Params Lead/text triplets Category triplets Autoencoder* ndim=300 0.790* 0.700* fastText mean ndim=256 0.911 0.687 Doc2Vec DM, ndim=300 0.969 0.650 Doc2Vec DM, ndim=32 0.883 0.687 Triplet CNN ndim=256 0.953 0.731 Triplet CNN ndim=32 0.927 0.721 * evaluation dataset not identical, but representative
  • 26. ● Possible improvements ○ ■ Predict one or more words given paragraph from an article ■ Tags classification ○ ■ Use language model to create word embeddings ● ELMo - Deep contextualized word representations
  • 28.
  • 29. ● Hierarchical Neural Story Generation, by Facebook AI Research, Angela Fan, Mike Lewis, Yann Dauphin (arXiv:1805.04833v1 [cs.CL]) ● Story prompt: ○ ● Story: ○
  • 32. Explicit feedback - user interaction with content has a numerical value representing how much user enjoyed the content, for example movie ratings - low rating user didn’t like the content - high rating, user liked the content - try to predict ratings for each item, recommend based on these predictions Implicit feedback - user either interacted with the content, or did not. - interaction probably means user is interested in the content - lack of interaction is hard to interpret
  • 33. user - item matrix i1 i2 i3 i4 i5 u1 4 4 5 u2 1 3 3 u3 4 4 2 2 u4 2 4 3 u5 3 5 4 u6 3 5
  • 34. Motivation - user - item matrix can be very large, for example 1 000 000 users and 100 000 items, - we can compare only the users who interacted on the same content, - we would like a dense representation in low dimensional latent space.
  • 35. Optimization objective - (R)MSE Observed rating Dot product of user and item Regularization terms
  • 36. Optimization objective Confidence in observed value. Binary indicator r 0 or 1 for news articles Y. Hu, Y. Koren, C. Volinsky (2008). Collaborative Filtering for Implicit Feedback Datasets
  • 37. ● Requirements ○ ○ ○ ○ ● Model features ○ ○ ○ ○ ○ ○
  • 38. ● Spotify personalization ○ ○ ○ ○ ● 24sata equivalent ○ ○ ○ ○
  • 39. ● One model to rule them all ○ ● Input data ○ | | | | | | | |query ● Output ○ ● Model ○ ■ Can take sequence of read articles (Temporal CNN or LSTM) ■ Can take huge amounts of data in batches ■ Learns very complicated interactions ■ We can model categorical data as embeddings
  • 40. ● Problems ○ ○ ○ ● Solutions ○ ○ ■ If a user saw the article more than once and decided not to click, we are more confident that user didn’t click because she was not interested, increase the weight ○
  翻译: