简体   繁体   中英

BERT Text Classification Tasks for Beginners

Can anyone list in simple terms tasks involved in building a BERT text classifier for someone new to CS working on their first project? Mine involves taking a list of paragraph length humanitarian aid activity descriptions (with corresponding titles and sector codes in the CSV file) and building a classifier able to assign sector codes to the descriptions, using a separate list of sector codes and their sentence long descriptions. For training, testing and evaluation, I'll compare the codes my classifier generates with those in the CSV file.

Any thoughts on high level tasks/steps involved to help me make my project task checklist? I started a Google CoLab notebook, made two CSV files, put them in a Google cloud bucket and I guess I have to pull the files, tokenize the data and ? Ideally I'd like to stick with Google tools too.

As the comments say, I suggest you to start with a blog or tutorial. The common tasks to use tensorflow BERT 's model is to use the tensorflow_hub . There you have 2 modules: BERT preprocessor and BERT encoder . Bert preprocessor prepares your data (with tokenization) and the next one transforms the data into mathematical language representation. If you are trying to use cosine similarities between 2 utterances, I have to say, BERT is not made for this type of process. It is normal to use BERT as a step to reach an objective, not an objective itself. That is, build a model that uses BERT , but for the beginning, use just BERT to understand how it works.

BERT preprocess

It has multiple keys (its output it's a dict):

dict_keys(['input_mask', 'input_type_ids', 'input_word_ids'])

Respectively, there are the "where are the tokens", "the shape of the inputs" and "the token number of them"

BERT encoder

It has multiple keys (its output it's a dict):

dict_keys(['default', 'encoder_outputs', 'pooled_output', 'sequence_output'])

In order, "same as pooled_output", "the output of the encoders", "the context of each utterance", "the context of each token inside the utterance".

Take a look here (search for bert)

Also watch this question I made

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM