Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic

dr. sc. Marko Velić
Head of Styria Data Science
Enes Deumić
Senior Data Scientist

● Styria
○
○
● Styria Data Science Team
○
○
○

● Computer Vision
○
○
● NLP
○
● Users’ behavioral interests
○
○
● Combining NLP and CV research

● Tag Cleaning
● Tag Classification
● Contextual recommender
● Text generation
● Personalization - Overview
● Personalization - Deep Learning

Google Digital News Initiative (DNI)

“The DNI Fund has committed €150m over three years to the European news
ecosystem. It encourages new thinking in digital journalism, and gives news
organisations of all sizes the space to innovate and experiment.”

● “a personalised platform for regional news publishing”
● implemented in our custom CMS - Publitzer
● 750k EUR
● 3 main areas
○
■ merging, cleaning
■ recommendations
○
■ automatically find similar/relevant articles
○
■ personalized recommenders
■ personalized homepage
■ better experience

● in production on
24sata
○
○
○
● since early 2016
● CTR uplift 36.2%
● over 3.000.000 daily
API calls
● easy-to-embed
widget

● Motivation
○
○
○
● Solution - Tag recommender
○
○
○

● Motivation
○
■ Over 100k tags with only one appearance
■ Misspellings
■ Different names for the same entity (FC Dinamo, Dinamo, dinamo, CFC dinamo…)
■ Absurd tags: “two skeletons are riding a motorbike”
○
● Solution - tool for assisted tag merging - Sanitag
○
○
○
○
○

First results:
47658 tags merged to
7737 more relavant tags

● Problems
○
■ 130k tags with frequency 1
○
○
● Solution
○
■ Tag vector = mean(document vectors containing a given tag)
■ Tag “United States of America” and “USA” get similar representations
○
■ Tag vector = mean(fastText tag words)
■ Better than simple edit distance

● Different models
○
○
○
○
● Subword information improves
results
Misspelled input tag:

● fastText supervised classifier baseline
○
● Trained on ~460K articles with ~17K associated tags
○
○
● Arbitrary number of predicted labels k; experimental k=10
● Future work
○
○

● Quantitatively evaluated using recall
○
○
○
○
● Example of model prediction

● Motivation
○
● Similarity
○
○
■ C4, C5, Picasso …
■ Peugeot, Renault, Opel…

● Triplets marked by annotators
○
○
○
● Automated triplets
○
■ Title
■ Lead
■ Text
○ (article1, article2)
○ (title + lead of article1, text of article1, text of
article2)

Source: Andrew Ng CS294A lecture notes
● BOW input
● Can be deep
● Constraints on
embedding layer
● Hyperparameter
optimization
maximizing some
external metric

M. Dai, Andrew & Olah, Christopher & V. Le, Quoc. (2015). Document Embedding with Paragraph Vectors
PV-Distributed Memory PV-Distributed Bag of Words

● Triplet architecture
● Shared weights
● Temporal CNN for text
with self-attention
○
○
● Triplet margin loss
Deep metric learning using Triplet network
arXiv:1412.6622v3 [cs.LG], fig 1.

● Generate triplet dataset
○
■ Randomly sample an article
■ Calculate n = 15 most similar articles
■ Randomly select two numbers (i, j) in range (0, n - 1)
■ Positive example min(i, j), negative example max(i, j)
○
■ Positive example from the same child node
■ Negative example from sister node or different parent
○
■ Anchor title + lead from article A
■ Positive example, text (excluding title and lead) from A
■ Negative example, text from an article similar to A

● Create multiple batch generators
● Each batch
○
■ Category triplets
■ Doc2vec triplets
■ Anchor lead + title, anchor text, similar text
■ Anchor text, some close article, random article text
■ …
○
■ Depending on a triplet type, use different margins
■ Classification loss
● Predict category from embedding (cross-entropy)
● Predict tags from embeddings (multilabel sigmoid loss)
● Better generalization

Model Params Lead/text triplets Category triplets
Autoencoder* ndim=300 0.790* 0.700*
fastText
mean
ndim=256 0.911 0.687
Doc2Vec DM, ndim=300 0.969 0.650
Doc2Vec DM, ndim=32 0.883 0.687
Triplet CNN ndim=256 0.953 0.731
Triplet CNN ndim=32 0.927 0.721
* evaluation dataset not identical, but representative

● Possible improvements
○
■ Predict one or more words given paragraph from an article
■ Tags classification
○
■ Use language model to create word embeddings
● ELMo - Deep contextualized word representations

● Hierarchical Neural Story Generation, by Facebook AI Research, Angela Fan,
Mike Lewis, Yann Dauphin (arXiv:1805.04833v1 [cs.CL])
● Story prompt:
○
● Story:
○

● Prompt:
○
● Generated story
○

Explicit feedback
- user interaction with content has a numerical value representing how
much user enjoyed the content, for example movie ratings
- low rating user didn’t like the content
- high rating, user liked the content
- try to predict ratings for each item, recommend based on these
predictions
Implicit feedback
- user either interacted with the content, or did not.
- interaction probably means user is interested in the content
- lack of interaction is hard to interpret

user - item matrix
i1
i2
i3
i4
i5
u1
4 4 5
u2
1 3 3
u3
4 4 2 2
u4
2 4 3
u5
3 5 4
u6
3 5

Motivation
- user - item matrix can be very large, for example 1 000 000 users and
100 000 items,
- we can compare only the users who interacted on the same content,
- we would like a dense representation in low dimensional latent space.

Optimization objective
- (R)MSE
Observed rating Dot product of
user and item
Regularization
terms

Optimization objective
Confidence in
observed value. Binary indicator
r 0 or 1 for news
articles
Y. Hu, Y. Koren, C. Volinsky (2008). Collaborative Filtering for Implicit Feedback Datasets

● Requirements
○
○
○
○
● Model features
○
○
○
○
○
○

● Spotify personalization
○
○
○
○
● 24sata equivalent
○
○
○
○

● One model to rule them all
○
● Input data
○ | | | | | | | |query
● Output
○
● Model
○
■ Can take sequence of read articles (Temporal CNN or LSTM)
■ Can take huge amounts of data in batches
■ Learns very complicated interactions
■ We can model categorical data as embeddings

● Problems
○
○
○
● Solutions
○
○
■ If a user saw the article more than once and decided not to click, we are more
confident that user didn’t click because she was not interested, increase the weight
○

Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic

Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic

Recommended

Recommended

More Related Content

Similar to Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic

Similar to Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic (20)

More from Institute of Contemporary Sciences

More from Institute of Contemporary Sciences (20)

Recently uploaded

Recently uploaded (20)

Machine Learning in News Media: Case study 24sata.hr - Marko Velic, Enes Deumic