简体   繁体   中英

NLP using replacement tokens

I read a lot of articles that deal with different NLP classification tasks and I saw that most of them specify in the pre-processing section that they use replacement tokens:

eg We removed and replaced the URLs, emojis and punctuation with replacement tokens: <URL>, <EMOJI>, <PUNCT> .

I am quite new to this domain and I was wondering if there is some special way to deal with this kind of tokens/tags? Is it necessary to use < > or is this just a way to signal this replacement and for helping the classifier in finding a pattern?

Any help would be greatly appreciated.

From what I did, in the pre-processing section, people replace all tokens (chars, morphemes, words) with numbers. These replacement tokens are nothing but numbers as well, <URL> is just a way to present it to humans.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM