尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
Paper Discussion #15
Object Recognition as Next Token Prediction (CVPR 2024)
© NABLAS Inc.
2
Idea
Use a pair of an image encoder and a language decoder as an (open-ended) image recognizer
which returns a list of all objects in a given image
In this case, we’ll get a sequence of tokens as output
[“so”, “fa”, “[SEP]”, “cat”, “[SEP]”, “blank”, “et”, “[SEP]”]
→ “sofa”, “cat”, “blanket” after post processing
© NABLAS Inc.
3
Problems that current open-ended image recognizer (e.g. CLIP) have
● Need to predefine a set of class descriptions
● As the set becomes larger, accuracy decreases
← Is it possible to eliminate this step?
© NABLAS Inc.
4
Straightforward way: using LLM
● With a few-shot learning, it requires good samples (& it doesn’t scale?)
● With a zero-shot learning, No explicit way to specify target classes → low accuracy
© NABLAS Inc.
5
CLIP image encoder + FC
※ Except the last 6 blocks, it is
frozen
First 6 blocks and the last block only
Pipeline in more details
Image Embeddings [IMG] “the objects in the image are”
Learnable
© NABLAS Inc.
6
Data preprocess
© NABLAS Inc.
7
Formulation: current image recognizer (e.g. ResNet, CLIP)
Average pooling (ResNet)
[cls] token or token pooling
Fully-connected layer (ResNet)
Set of embedding vectors of predefined class descriptions
Feature map (ResNet)
Set of token (image patch) vectors
Softmax
© NABLAS Inc.
8
Formulation: proposed image recognizer (in the case of each class is represented as single token)
Projection layer + LLM
Fully-connected layer (+ layer normalization)
Set of token (image patch) vectors
Softmax
© NABLAS Inc.
9
Formulation: proposed image recognizer (in the case of each class is represented as possibly multiple tokens)
© NABLAS Inc.
10
Final objective function (multiple labels with multiple tokens each)
© NABLAS Inc.
11
Customized non-causal attention mask
Causal attention mask
Proposed non-causal
attention mask
Query  Key
© NABLAS Inc.
12
One-shot sampling (or parallel sampling)
This is the first token for the first label
This is also the first token for the second label
The key to its parallelism lies in the non-causal masking
mechanism, which also avoids the repetition issue (?)
© NABLAS Inc.
13
Experiment settings
Train dataset
(1) G3M - CC3M / COCO Captions / SBU
(2) G70M - 67M from LAION-Synthetic-115M / G3M
Eval dataset
Eval splits of CC3M / COCO Captions / OpenImages V7
Input image preprocessing
● Same to CLIP image encoder
● 224 x 224 resolution
Others
● No [cls] token in CLIP image encoder
● (32K-1) tokens (text) for output
● No [eos] token (instead of it [sep] token is used)
● We shuffle labels for each image in training (?)
● The global batch size is 512
© NABLAS Inc.
14
Metric
BERTScore is used
The number of objects in a given image
The number of predicted objects in a given image
© NABLAS Inc.
15
Recall@10 is higher while Precision@10 is lower
→ What does it mean? → It generates various classes that cover gt but some doesn’t match
© NABLAS Inc.
16
First 11 blocks are more important to image recognition
© NABLAS Inc.
17
Truncating larger LM is better than using smaller LM as it is
© NABLAS Inc.
18
Proposed sampling works comparable (& beam search works worse for some reason)
© NABLAS Inc.
19
Proposed attention masking slightly contributes to the results
© NABLAS Inc.
20
For larger train dataset, LLaMA 1 works better 🤔
© NABLAS Inc.
21
vs GPT-4V Preview (gray)
© NABLAS Inc.
22
They say training on CC13M (more noisy) underperforms training on CC3M
© NABLAS Inc.
23
Removing intermediate blocks of LLM doesn’t affect score much
It works even with single block 😯

More Related Content

Similar to 社内勉強会資料_Object Recognition as Next Token Prediction

2021 04-01-dalle
2021 04-01-dalle2021 04-01-dalle
2021 04-01-dalle
JAEMINJEONG5
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
taeseon ryu
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 
Machine Vision on Embedded Hardware
Machine Vision on Embedded HardwareMachine Vision on Embedded Hardware
Machine Vision on Embedded Hardware
Jash Shah
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Sangwoo Mo
 
contrastive-learning2.pdf
contrastive-learning2.pdfcontrastive-learning2.pdf
contrastive-learning2.pdf
omogire
 
jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretrainingjefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretraining
cevesom156
 
AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdf
ssuserb4d806
 
Close encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet CodeClose encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet Code
lbergmans
 
Close Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codeClose Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet code
lbergmans
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
Edge AI and Vision Alliance
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
Junho Cho
 
20190927 generative models_aia
20190927 generative models_aia20190927 generative models_aia
20190927 generative models_aia
Yi-Fan Liou
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
Edge AI and Vision Alliance
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
rahul_net
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
VasileiosMezaris
 
#6 PyData Warsaw: Deep learning for image segmentation
#6 PyData Warsaw: Deep learning for image segmentation#6 PyData Warsaw: Deep learning for image segmentation
#6 PyData Warsaw: Deep learning for image segmentation
Matthew Opala
 
Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNet
Amazon Web Services
 
Computer Vision - Real Time Face Recognition using Open CV and Python
Computer Vision - Real Time Face Recognition using Open CV and PythonComputer Vision - Real Time Face Recognition using Open CV and Python
Computer Vision - Real Time Face Recognition using Open CV and Python
Akash Satamkar
 
Intelligent Thumbnail Selection
Intelligent Thumbnail SelectionIntelligent Thumbnail Selection
Intelligent Thumbnail Selection
Kamil Sindi
 

Similar to 社内勉強会資料_Object Recognition as Next Token Prediction (20)

2021 04-01-dalle
2021 04-01-dalle2021 04-01-dalle
2021 04-01-dalle
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Machine Vision on Embedded Hardware
Machine Vision on Embedded HardwareMachine Vision on Embedded Hardware
Machine Vision on Embedded Hardware
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
 
contrastive-learning2.pdf
contrastive-learning2.pdfcontrastive-learning2.pdf
contrastive-learning2.pdf
 
jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretrainingjefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretraining
 
AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdf
 
Close encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet CodeClose encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet Code
 
Close Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codeClose Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet code
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
20190927 generative models_aia
20190927 generative models_aia20190927 generative models_aia
20190927 generative models_aia
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
#6 PyData Warsaw: Deep learning for image segmentation
#6 PyData Warsaw: Deep learning for image segmentation#6 PyData Warsaw: Deep learning for image segmentation
#6 PyData Warsaw: Deep learning for image segmentation
 
Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNet
 
Computer Vision - Real Time Face Recognition using Open CV and Python
Computer Vision - Real Time Face Recognition using Open CV and PythonComputer Vision - Real Time Face Recognition using Open CV and Python
Computer Vision - Real Time Face Recognition using Open CV and Python
 
Intelligent Thumbnail Selection
Intelligent Thumbnail SelectionIntelligent Thumbnail Selection
Intelligent Thumbnail Selection
 

More from NABLAS株式会社

社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
NABLAS株式会社
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
NABLAS株式会社
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
【NABLAS Inc.】Recruitment materials - Ver. 2024
【NABLAS Inc.】Recruitment materials - Ver. 2024【NABLAS Inc.】Recruitment materials - Ver. 2024
【NABLAS Inc.】Recruitment materials - Ver. 2024
NABLAS株式会社
 
【NABLAS株式会社】採用ピッチ資料 Ver. 2024           .
【NABLAS株式会社】採用ピッチ資料 Ver. 2024           .【NABLAS株式会社】採用ピッチ資料 Ver. 2024           .
【NABLAS株式会社】採用ピッチ資料 Ver. 2024           .
NABLAS株式会社
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
NABLAS株式会社
 

More from NABLAS株式会社 (8)

社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
【NABLAS Inc.】Recruitment materials - Ver. 2024
【NABLAS Inc.】Recruitment materials - Ver. 2024【NABLAS Inc.】Recruitment materials - Ver. 2024
【NABLAS Inc.】Recruitment materials - Ver. 2024
 
【NABLAS株式会社】採用ピッチ資料 Ver. 2024           .
【NABLAS株式会社】採用ピッチ資料 Ver. 2024           .【NABLAS株式会社】採用ピッチ資料 Ver. 2024           .
【NABLAS株式会社】採用ピッチ資料 Ver. 2024           .
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 

Recently uploaded

MySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdfMySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdf
Ananta Patil
 
_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf
rc76967005
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
incitbe
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
sapna sharmap11
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
ThinkInnovation
 
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
PsychoTech Services
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
prijesh mathew
 
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
uthkarshkumar987000
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Russian Escorts in Delhi 9711199171 with low rate Book online
 
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering RoadshowFabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Gabi Münster
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
zoykygu
 
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOWAI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
arash10gamer
 
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in LucknowCall Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
hiju9823
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
nitachopra
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
jasodak99
 
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
nainasharmans346
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
Vijayabaskar Uthirapathy
 

Recently uploaded (20)

MySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdfMySQL Notes For Professionals sttudy.pdf
MySQL Notes For Professionals sttudy.pdf
 
_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf_Lufthansa Airlines MIA Terminal (1).pdf
_Lufthansa Airlines MIA Terminal (1).pdf
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
 
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
Essential Skills for Family Assessment - Marital and Family Therapy and Couns...
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
 
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your DoorAhmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
Ahmedabad Call Girls 7339748667 With Free Home Delivery At Your Door
 
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering RoadshowFabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
Fabric Engineering Deep Dive Keynote from Fabric Engineering Roadshow
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
 
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOWAI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
AI WITH THE HELP OF NAGALAND CAN WIN. DOWNLOAD NOW
 
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in LucknowCall Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
❣VIP Call Girls Chennai 💯Call Us 🔝 7737669865 🔝💃Independent Chennai Escorts S...
 
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
Hot Call Girls In Bangalore 🔥 9352988975 🔥 Real Fun With Sexual Girl Availabl...
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
machine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Mamachine learning notes by Andrew Ng and Tengyu Ma
machine learning notes by Andrew Ng and Tengyu Ma
 

社内勉強会資料_Object Recognition as Next Token Prediction

  • 1. Paper Discussion #15 Object Recognition as Next Token Prediction (CVPR 2024)
  • 2. © NABLAS Inc. 2 Idea Use a pair of an image encoder and a language decoder as an (open-ended) image recognizer which returns a list of all objects in a given image In this case, we’ll get a sequence of tokens as output [“so”, “fa”, “[SEP]”, “cat”, “[SEP]”, “blank”, “et”, “[SEP]”] → “sofa”, “cat”, “blanket” after post processing
  • 3. © NABLAS Inc. 3 Problems that current open-ended image recognizer (e.g. CLIP) have ● Need to predefine a set of class descriptions ● As the set becomes larger, accuracy decreases ← Is it possible to eliminate this step?
  • 4. © NABLAS Inc. 4 Straightforward way: using LLM ● With a few-shot learning, it requires good samples (& it doesn’t scale?) ● With a zero-shot learning, No explicit way to specify target classes → low accuracy
  • 5. © NABLAS Inc. 5 CLIP image encoder + FC ※ Except the last 6 blocks, it is frozen First 6 blocks and the last block only Pipeline in more details Image Embeddings [IMG] “the objects in the image are” Learnable
  • 7. © NABLAS Inc. 7 Formulation: current image recognizer (e.g. ResNet, CLIP) Average pooling (ResNet) [cls] token or token pooling Fully-connected layer (ResNet) Set of embedding vectors of predefined class descriptions Feature map (ResNet) Set of token (image patch) vectors Softmax
  • 8. © NABLAS Inc. 8 Formulation: proposed image recognizer (in the case of each class is represented as single token) Projection layer + LLM Fully-connected layer (+ layer normalization) Set of token (image patch) vectors Softmax
  • 9. © NABLAS Inc. 9 Formulation: proposed image recognizer (in the case of each class is represented as possibly multiple tokens)
  • 10. © NABLAS Inc. 10 Final objective function (multiple labels with multiple tokens each)
  • 11. © NABLAS Inc. 11 Customized non-causal attention mask Causal attention mask Proposed non-causal attention mask Query Key
  • 12. © NABLAS Inc. 12 One-shot sampling (or parallel sampling) This is the first token for the first label This is also the first token for the second label The key to its parallelism lies in the non-causal masking mechanism, which also avoids the repetition issue (?)
  • 13. © NABLAS Inc. 13 Experiment settings Train dataset (1) G3M - CC3M / COCO Captions / SBU (2) G70M - 67M from LAION-Synthetic-115M / G3M Eval dataset Eval splits of CC3M / COCO Captions / OpenImages V7 Input image preprocessing ● Same to CLIP image encoder ● 224 x 224 resolution Others ● No [cls] token in CLIP image encoder ● (32K-1) tokens (text) for output ● No [eos] token (instead of it [sep] token is used) ● We shuffle labels for each image in training (?) ● The global batch size is 512
  • 14. © NABLAS Inc. 14 Metric BERTScore is used The number of objects in a given image The number of predicted objects in a given image
  • 15. © NABLAS Inc. 15 Recall@10 is higher while Precision@10 is lower → What does it mean? → It generates various classes that cover gt but some doesn’t match
  • 16. © NABLAS Inc. 16 First 11 blocks are more important to image recognition
  • 17. © NABLAS Inc. 17 Truncating larger LM is better than using smaller LM as it is
  • 18. © NABLAS Inc. 18 Proposed sampling works comparable (& beam search works worse for some reason)
  • 19. © NABLAS Inc. 19 Proposed attention masking slightly contributes to the results
  • 20. © NABLAS Inc. 20 For larger train dataset, LLaMA 1 works better 🤔
  • 21. © NABLAS Inc. 21 vs GPT-4V Preview (gray)
  • 22. © NABLAS Inc. 22 They say training on CC13M (more noisy) underperforms training on CC3M
  • 23. © NABLAS Inc. 23 Removing intermediate blocks of LLM doesn’t affect score much It works even with single block 😯
  翻译: