简体繁体中英

Input format for BERT fine-tuning on corpus

原文 2020-11-26 15:19:24 9 1 python/ nlp/ bert-language-model/ transformer

I want to fine-tune BERT on a specific language domain using the following git repo:

https://github.com/cedrickchee/pytorch-pretrained-BERT/blob/master/examples/lm_finetuning/README.md

Regarding the input format, it says:

The scripts in this folder expect a single file as input, consisting of untokenized text, with one sentence per line, and one blank line between documents. The reason for the sentence splitting is that part of BERT's training involves a next sentence objective in which the model must predict whether two sequences of text are contiguous text from the same document or not, and to avoid making the task too easy, the split point between the sequences is always at the end of a sentence. The linebreaks in the file are therefore necessary to mark the points where the text can be split.

What do they mean with documents in this regard? From my understanding, the .txt file used for fine-tuning just contains a lot of domain specific text with one sentence per line. Just to be sure, is it the correct approach to use this repository if I want to fine tune BERT on a specific language domain?

Thank you for your help!

1 answers

The script you are talking about is the right one for continuing pre-training. The original BERT uses next-sentence prediction as an auxiliary objective. When it is provided a pair of sentences (separated by the [SEP] token), the embedding of the [CLS] (the very first one) token is used as an input to a classifier telling if the sentences are adjacent in a coherent text or not.

This what the empty lines are for: on a document boundary, the sentences cannot adjacent.

However, the contribution of the next-sentence objective is arguable. For instance, the RoBERTa considers it superfluous and only uses the masked-language-modeling objective a still gets better representation quality than the original BERT.

BERT always predicts same class (Fine-Tuning)

Low accuracy when fine-tuning BERT for question answering

Formatting our data into PyTorch Dataset object for fine-tuning BERT

BERT fine-tuning with Estimators on TPUs on colab TypeError: unsupported operand type(s) for *=: 'NoneType' and 'int'

Is there a way to use bert-large as a text classification tool without fine-tuning?

Fine-Tuning InceptionV3

Transfer learning or fine-tuning

Fine-tuning a deep neural network in Tensorflow

Organize data for transformer fine-tuning

Wor2vec fine-tuning

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question BERT always predicts same class (Fine-Tuning) Low accuracy when fine-tuning BERT for question answering Formatting our data into PyTorch Dataset object for fine-tuning BERT BERT fine-tuning with Estimators on TPUs on colab TypeError: unsupported operand type(s) for *=: 'NoneType' and 'int' Is there a way to use bert-large as a text classification tool without fine-tuning? Fine-Tuning InceptionV3 Transfer learning or fine-tuning Fine-tuning a deep neural network in Tensorflow Organize data for transformer fine-tuning Wor2vec fine-tuning

Related Tags

Input format for BERT fine-tuning on corpus

Question

1 answers

solution1 0 2020-11-30 08:12:11

solution1
0 2020-11-30 08:12:11