简体   繁体   English

R拼写检查/标记器

[英]R spell checker / tokenizer

I'm not sure if R is the right place to try this or not but here's my situation. 我不确定R是否适合尝试这个或不是,但这是我的情况。 I have a character vector full of strings. 我有一个充满字符串的字符向量。

id    Words
 1    'The'
 2    'victory'
 3    'wasgreat'
...   ...

The original data had some encoding problems and some of the strings are concatenizations of several words: 原始数据有一些编码问题,一些字符串是几个单词的连接:

 (ie 'My name is' -> 'Mynameis').

I need to leave the correct words alone and get the misspelled concatenizations separated into their correct substrings. 我需要单独留下正确的单词,并将拼写错误的连接分成正确的子串。

I'm curious if there's any setup in R to handle this type of problem. 我很好奇R中是否有任何设置来处理这类问题。 I think that there are several programs in python that would handle this much better but my python skills are substantially weaker (bordering on non-existent). 我认为python中有几个程序可以更好地处理这个程序但是我的python技能要弱得多(接近不存在)。 However, I'd be willing to consider it as an alternative. 但是,我愿意将其作为替代方案。

Any suggestions? 有什么建议?

最新一期的R Journal一篇Hornik和Murdoch在R上一篇关于拼写检查文章,对救援的递归,它们适用于R源本身。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM