I wanted to make texts readable for BERT-embeddings by inserting the [CLS] and [SEP] tokens. I tokenized my text so I have a list with every word and punctuation mark as element, however, I don't know how exactly I can add elements after every occurrence of '.' in my text.
Does anyone know what I can do? Or do you know if there is something that prepares BERT-readable-texts?
I think this answers your question:
https://github.com/google-research/bert#tokenization
As mentioned, you can see how they have done it in run_classifier.py
and extract_features.py
.
However, you can also accomplish what you want by using regular expressions (regex). In python, this would look something like:
import re
regex = r"[.]"
test_str = "Hello . BERT . Goodbye ."
subst = ". [SEP]"
result = re.sub(regex, subst, test_str)
if result:
print (result)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.