简体   繁体   English

NLP 使用替换令牌

[英]NLP using replacement tokens

I read a lot of articles that deal with different NLP classification tasks and I saw that most of them specify in the pre-processing section that they use replacement tokens:我阅读了很多处理不同 NLP 分类任务的文章,我看到其中大多数在预处理部分指定他们使用替换标记:

eg We removed and replaced the URLs, emojis and punctuation with replacement tokens: <URL>, <EMOJI>, <PUNCT> .例如,我们删除并用替换标记替换了 URL、表情符号和标点符号: <URL>, <EMOJI>, <PUNCT>

I am quite new to this domain and I was wondering if there is some special way to deal with this kind of tokens/tags?我对这个领域很陌生,我想知道是否有一些特殊的方法来处理这种令牌/标签? Is it necessary to use < > or is this just a way to signal this replacement and for helping the classifier in finding a pattern?是否有必要使用< >或者这只是表示这种替换并帮助分类器找到模式的一种方式?

Any help would be greatly appreciated.任何帮助将不胜感激。

From what I did, in the pre-processing section, people replace all tokens (chars, morphemes, words) with numbers.根据我所做的,在预处理部分,人们用数字替换所有标记(字符、词素、单词)。 These replacement tokens are nothing but numbers as well, <URL> is just a way to present it to humans.这些替换标记也不过是数字<URL>只是将其呈现给人类的一种方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM