字符串预处理

Question

我正在处理的字符串列表可能包含一些原始字母的其他字母，例如：

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']

我想对这些字符串进行预处理，以便正确拼写，以检索新列表：

cleaned_words = ['why', 'hey', 'alright', 'cool', 'monday']

重复字母的序列长度可以变化，但是很明显， cool应该保持其拼写。

我不知道执行此操作的任何python库，因此，我最好尝试避免对它进行硬编码。

我已经尝试过： http : //norvig.com/spell-correct.html，但是您在文本文件中输入的字词越多，似乎提示它拼写错误的可能性就越大，因此它永远不会正确地实现，即使没有删除其他字母。 例如， eel变成teel ...

提前致谢。

Answer 1

如果只是重复的字母，您想剥离，那么使用正则表达式模块re可能会有所帮助：

>>> import re
>>> re.sub(r'(.)\1+$', r'\1', 'cool')
'cool'
>>> re.sub(r'(.)\1+$', r'\1', 'coolllll')
'cool'

（它使“酷”保持不变。）

对于前导重复字符，正确的替换为：

>>> re.sub(r'^(.)\1+', r'\1', 'mmmmonday')
'monday'

当然，对于合法地以重复字母开头或结尾的单词，这将失败...

Answer 2

如果要下载所有英语单词的文本文件进行检查，则这是另一种可行的方法。

我没有测试过，但是您知道了。 它会遍历字母，如果当前字母与最后一个字母匹配，则会从单词中删除字母。 如果将这些字母缩小到1，并且仍然没有有效的单词，它将把单词重新设置为正常，并继续直到找到下一个重复的字符。

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']
import urllib2
word_list = set(i.lower() for i in urllib2.urlopen('https://raw.githubusercontent.com/eneko/data-repository/master/data/words.txt').read().split('\n'))

found_words = []
for word in (i.lower() for i in words):

    #Check word doesn't exist already
    if word in word_list:
        found_words.append(word)
        continue

    last_char = None
    i = 0
    current_word = word
    while i < len(current_word):

        #Check if it's a duplicate character
        if current_word[i] == last_char:
            current_word = current_word[:i] + current_word[i + 1:]

        #Reset word if no more duplicate characters
        else:
            current_word = word
            i += 1
            last_char = current_word[i]

        #Word has been found
        if current_word in word_list:
            found_words.append(current_word)
            break

print found_words
#['why', 'hey', 'alright', 'cool', 'monday']

Answer 3

好吧，粗略的方法：

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']

res = []
for word in words:
    while word[-2]==word[-1]:
        word = word[:-1]
    while word[0]==word[1]:
        word = word[1:]
    res.append(word)
print(res)

结果： ['why', 'hey', 'alright', 'cool', 'monday']

字符串预处理

问题描述

3 个解决方案

解决方案1
1 2016-02-05 14:18:24

解决方案2
1 已采纳 2016-02-05 14:25:54

解决方案3
0 2016-02-05 16:47:56

字符串预处理

问题描述

3 个解决方案

解决方案1 1 2016-02-05 14:18:24

解决方案2 1 已采纳 2016-02-05 14:25:54

解决方案3 0 2016-02-05 16:47:56

解决方案1
1 2016-02-05 14:18:24

解决方案2
1 已采纳 2016-02-05 14:25:54

解决方案3
0 2016-02-05 16:47:56