简体   繁体   中英

How is BERT Layer sequence output used?

I am reading this Kaggle notebook .

In the class DisasterDetector, in build_model(), clf_output = sequence_output[:, 0, :] . A sigmoid activation is then applied in order to generate the model output.

The location the BertLayer was obtained from on tfhub describes the shape of sequence_output as [batch_size, max_seq_length, 768] . Why are we choosing only the first index over the max_seq_length dimension (indexing a 0)? If this corresponds to only the first token in the output sequence, and not the other tokens, why is this used in the binary classification task?

the first token of output sequence is from the first of input ,i e.[CLS]. the [CLS] is regarded as the represition of the whole input sequence. u can read the original paper to understand it better.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM