简体   繁体   中英

Input format for BERT fine-tuning on corpus

I want to fine-tune BERT on a specific language domain using the following git repo:

https://github.com/cedrickchee/pytorch-pretrained-BERT/blob/master/examples/lm_finetuning/README.md

Regarding the input format, it says:

The scripts in this folder expect a single file as input, consisting of untokenized text, with one sentence per line, and one blank line between documents. The reason for the sentence splitting is that part of BERT's training involves a next sentence objective in which the model must predict whether two sequences of text are contiguous text from the same document or not, and to avoid making the task too easy, the split point between the sequences is always at the end of a sentence. The linebreaks in the file are therefore necessary to mark the points where the text can be split.

What do they mean with documents in this regard? From my understanding, the .txt file used for fine-tuning just contains a lot of domain specific text with one sentence per line. Just to be sure, is it the correct approach to use this repository if I want to fine tune BERT on a specific language domain?

Thank you for your help!

The script you are talking about is the right one for continuing pre-training. The original BERT uses next-sentence prediction as an auxiliary objective. When it is provided a pair of sentences (separated by the [SEP] token), the embedding of the [CLS] (the very first one) token is used as an input to a classifier telling if the sentences are adjacent in a coherent text or not.

This what the empty lines are for: on a document boundary, the sentences cannot adjacent.

However, the contribution of the next-sentence objective is arguable. For instance, the RoBERTa considers it superfluous and only uses the masked-language-modeling objective a still gets better representation quality than the original BERT.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM