[英]R spell checker / tokenizer
I'm not sure if R is the right place to try this or not but here's my situation. 我不确定R是否适合尝试这个或不是,但这是我的情况。 I have a character vector full of strings.
我有一个充满字符串的字符向量。
id Words
1 'The'
2 'victory'
3 'wasgreat'
... ...
The original data had some encoding problems and some of the strings are concatenizations of several words: 原始数据有一些编码问题,一些字符串是几个单词的连接:
(ie 'My name is' -> 'Mynameis').
I need to leave the correct words alone and get the misspelled concatenizations separated into their correct substrings. 我需要单独留下正确的单词,并将拼写错误的连接分成正确的子串。
I'm curious if there's any setup in R to handle this type of problem. 我很好奇R中是否有任何设置来处理这类问题。 I think that there are several programs in python that would handle this much better but my python skills are substantially weaker (bordering on non-existent).
我认为python中有几个程序可以更好地处理这个程序但是我的python技能要弱得多(接近不存在)。 However, I'd be willing to consider it as an alternative.
但是,我愿意将其作为替代方案。
Any suggestions? 有什么建议?
最新一期的R Journal有一篇Hornik和Murdoch在R上的一篇关于拼写检查的文章,对救援的递归,它们适用于R源本身。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.