简体   繁体   English

R tm使用gsub替换语料库中的单词

[英]R tm substitute words in Corpus using gsub

I have a large document corpus with more than 200 documents. 我有一个包含200多个文档的大型文档语料库。 As you can expect from such a large corpus, some of the words are misspelled, used in different formats, and so on and so forth. 正如你可以从这么大的语料库中看到的那样,一些单词拼写错误,以不同的格式使用,依此类推。 I have done the standard text processing such as convert to lower case, remove punctuation, word stemming. 我已经完成了标准的文本处理,如转换为小写,删除标点符号,词干。 I am trying to substitute some words to correct spelling and standardize them before moving on to analysis. 在进行分析之前,我试图用一些单词替换正确的拼写并将其标准化。 I have done more that 100 substitution using the same syntax as below and for most of the substitutions, it is working as expected. 我使用与下面相同的语法完成了100次替换,对于大多数替换,它按预期工作。 However, some (about 5%) are not working. 但是,有些(约5%)不起作用。 For example the following substitutions seem to have only limited effect: 例如,以下替换似乎只有有限的影响:

docs <- tm_map(docs, content_transformer(gsub), pattern = "medecin|medicil|medicin|medicinee", replacement = "medicine")
docs <- tm_map(docs, content_transformer(gsub), pattern = "eephant|eleph|elephabnt|elleph|elephanyt|elephantant|elephantant", replacement = "elephant")
docs <- tm_map(docs, content_transformer(gsub), pattern = "firehood|firewod|firewoo|firewoodloc|firewoog|firewoodd|firewoodd", replacement = "firewood") 

By limited effect I mean that even though some substitutions are working, some are not. 由于效果有限,我的意思是即使某些替代品正在运作,但有些则不然。 For example, despite trying to replace " elephantant ", " medicinee ", " firewoodd ", they still exist when I create the DTM (document term matrix). 例如,尽管尝试替换“ elephantant ”,“ medicinee ”,“ firewoodd ”,但在我创建DTM(文档术语矩阵)时它们仍然存在。

I have no idea why this mixed effect is happening. 我不知道为什么会出现这种混合效应。

Also the following line is replacing every word in the corpus with some combination of collect: 另外,以下一行是用一些collect的组合替换语料库中的每个单词:

docs <- tm_map(docs, content_transformer(gsub), pattern = "colect|colleci|collectin|collectiong|collectng|colllect|", replacement = "collect")

Just for reference, when I substitute just a single word, I am using the syntax (notice the fixed=TRUE ): 仅供参考,当我只替换一个单词时,我使用的是语法(注意fixed = TRUE ):

docs <- tm_map(docs, content_transformer(gsub), pattern = "charcola", replacement = "charcoal", fixed=TRUE)

The one that is a single substitution and failing is: 单一替换和失败的是:

docs <- tm_map(docs, content_transformer(gsub), pattern = "dogmonkeycat", replacement = "dog monkey cat", fixed=TRUE)

The issue you have is that the alternations in your patterns are not anchored, and thus only the first one matched "wins", ie used, and the rest is not considered. 你遇到的问题是你的模式中的变化没有锚定,因此只有第一个匹配“胜利”,即使用,其余的不被考虑。

You should either use some "anchors" (say, word boundaries) around the alternations: 您应该在替换周围使用一些“锚点”(例如,单词边界):

pattern = "\\b(medecin|medicil|medicin|medicinee)\\b"

or just put the longer alternatives before shorter ones: 或者只是把较长的替代品之前,短的:

pattern = "medicinee|medecin|medicil|medicin"

Note that you can make the pattern faster by using character classes for commonly mistyped vowels (see [ei] ) and groups: 请注意,通过对常见错误的元音(参见[ei] )和组使用字符类,可以更快地使模式更快:

pattern = "med[ie]ci(?:n(?:ee)?|l)"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM