简体   繁体   中英

using huggingface's pytorch- transformers GPT-2 for classifcation tasks

I want to use GPT-2 to make a text classifier model. I am not really sure what head should I add after I extracted features through the GPT-2. for eample I have a sequence.

import pytorch_transformers as pt 
import torch
text=test.iloc[1,1]
text
'If a fire wanted fanning, it could readily be fanned with a newspaper, and as the government grew weaker, I have no doubt that leather and iron acquired durability in proportion, for, in a very short time, there was not a pair of bellows in all Rotterdam that ever stood in need of a stitch or required the assistance of a hammer.'
len(text)

74
tokenizer = pt.GPT2Tokenizer.from_pretrained('gpt2')
model = pt.GPT2Model.from_pretrained('gpt2')
zz = tokenizer.tokenize(text)
z1=torch.tensor([tokenizer.convert_tokens_to_ids(zz)])
z1
tensor([[ 1532,   257,  2046,  2227,  4336,   768,    11,   340,   714, 14704,
           307,   277,  3577,   351,   257,  7533,    11,   290,   355,   262,
          1230,  6348, 17642,    11,   314,   423,   645,  4719,   326, 11620,
           290,  6953,  9477, 26578,   287,  9823,    11,   329,    11,   287,
           257,   845,  1790,   640,    11,   612,   373,   407,   257,  5166,
           286,  8966,  1666,   287,   477, 18481,   353, 11043,   326,  1683,
          6204,   287,   761,   286,   257, 24695,   393,  2672,   262,  6829,
           286,   257, 15554,    13]])
output,hidden=model(z1)
ouput.shape
torch.Size([1, 74, 768])

the output of GPT2 is nxmx 768 for me, which n is the batch size,m is the number of tokens in the seqence(for example I can pad/truncate to 128.), so I can not do what as the paper said for a classification task just add a fully connected layer in the tail.And I searched on google, few GPT-2 classification task is mensioned. I am not sure what is correct. Should I do flatten/max pooling/average pooling before the fully connected layer or something else?

" so I can not do what as the paper said for a classification task just add a fully connected layer in the tail." - This is the answer to your question .

Usually, transformers like BERT and Roberta, have bidirectional self-attention and they have the [CLS] token where we feed in to the classfier. Since GPT-2 is left-right you need to feed the final token of the embeddings sequence.

PS - Can you put the link to the paper.

如果您使用 GPT-2 构建模型进行文本分类,请分享。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM