String preprocessing

Question

I'm dealing with a list of strings that may contain some additional letters to its original spelling, for example:

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']

I want to pre-process these strings so that they are spelt correctly, to retrieve a new list:

cleaned_words = ['why', 'hey', 'alright', 'cool', 'monday']

The length of the sequence of the duplicated letter can vary, however, obviously cool should maintain its spelling.

I'm unaware of any python libraries that do this, and I'd preferably like to try and avoid hard coding it.

I've tried this: http://norvig.com/spell-correct.html but the more words you put in the text file, it seems there's more chance of it suggesting the incorrect spelling, so it's never actually getting it right, even without the removed additional letters. For example, eel becomes teel ...

Thanks in advance.

Answer 1

If it's only repeated letters you want to strip then using the regular expression module re might help:

>>> import re
>>> re.sub(r'(.)\1+$', r'\1', 'cool')
'cool'
>>> re.sub(r'(.)\1+$', r'\1', 'coolllll')
'cool'

(It leaves 'cool' untouched.)

For leading repeated characters the correct substitution would be:

>>> re.sub(r'^(.)\1+', r'\1', 'mmmmonday')
'monday'

Of course this fails for words that legitimately start or end with repeated letters ...

Answer 2

If you were to download a text file of all english words to check against, this is another way that could work.

I've not tested it but you get the idea. It iterates through the letters, and if the current letter matches the last one, it'll remove the letter from the word. If it narrows down those letters to 1, and there is still no valid word, it'll reset the word back to normal and continue until the next duplicate characters are found.

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']
import urllib2
word_list = set(i.lower() for i in urllib2.urlopen('https://raw.githubusercontent.com/eneko/data-repository/master/data/words.txt').read().split('\n'))

found_words = []
for word in (i.lower() for i in words):

    #Check word doesn't exist already
    if word in word_list:
        found_words.append(word)
        continue

    last_char = None
    i = 0
    current_word = word
    while i < len(current_word):

        #Check if it's a duplicate character
        if current_word[i] == last_char:
            current_word = current_word[:i] + current_word[i + 1:]

        #Reset word if no more duplicate characters
        else:
            current_word = word
            i += 1
            last_char = current_word[i]

        #Word has been found
        if current_word in word_list:
            found_words.append(current_word)
            break

print found_words
#['why', 'hey', 'alright', 'cool', 'monday']

Answer 3

Well, a crude way:

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']

res = []
for word in words:
    while word[-2]==word[-1]:
        word = word[:-1]
    while word[0]==word[1]:
        word = word[1:]
    res.append(word)
print(res)

Result: ['why', 'hey', 'alright', 'cool', 'monday']

String preprocessing

Question

3 answers

solution1
1 2016-02-05 14:18:24

solution2
1 ACCPTED 2016-02-05 14:25:54

solution3
0 2016-02-05 16:47:56

String preprocessing

Question

3 answers

solution1 1 2016-02-05 14:18:24

solution2 1 ACCPTED 2016-02-05 14:25:54

solution3 0 2016-02-05 16:47:56

solution1
1 2016-02-05 14:18:24

solution2
1 ACCPTED 2016-02-05 14:25:54

solution3
0 2016-02-05 16:47:56