I read a lot of articles that deal with different NLP classification tasks and I saw that most of them specify in the pre-processing section that they use replacement tokens:
eg We removed and replaced the URLs, emojis and punctuation with replacement tokens:
<URL>, <EMOJI>, <PUNCT>
.
I am quite new to this domain and I was wondering if there is some special way to deal with this kind of tokens/tags? Is it necessary to use < >
or is this just a way to signal this replacement and for helping the classifier in finding a pattern?
Any help would be greatly appreciated.
From what I did, in the pre-processing section, people replace all tokens (chars, morphemes, words) with numbers. These replacement tokens are nothing but numbers as well, <URL>
is just a way to present it to humans.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.