简体   繁体   中英

Regex: Efficiently matching words that are the same except for last character

How can I efficiently match words that are the same except for the last letter?

data = ['ades', 'adey', 'adhere', 'adherent', 'admin', 'admit', 'adverb', 'advert', 'adipocere', 'adipocerous', 'adjoining', 'adjoint', 'adjudicate', 'adjudication', 'adjunct']

The actual data is longer and my implementation below takes too long to run:

temp_data = data 
count = 0
matches = {}
while count < len(data):
    for word in data:
        if word[:-1] == data[count][:-1] and data.index(word) != count:
            matches[data[count]] = word
            temp_data.remove(data[count])
            temp_data.remove(word)
    count += 1
print(matches)

this correctly prints:

{'ades': 'adey', 'advert': 'adverb', 'admin': 'admit'}

I'm new to python so any suggestions would be appreciated :)

You're comparing every word against every word and using a check that compares every word every time to make sure you're not comparing a word against itself for O(n³) time. You can get it to O(n²) time by keeping track of the index in the inner loop:

for j, word in enumerate(data):
    if word[:-1] == data[count][:-1] and j != count:
        matches[data[count]] = word
        temp_data.remove(data[count])
        temp_data.remove(word)

and then get it to O(n) by just grouping the words by their initial letters:

groups = defaultdict(list)

for word in data:
    groups[word[:-1]].append(word)

print(list(groups.values()))

which can also be done using groupby if your list is sorted:

import itertools

def init(word):
    return word[:-1]

print([list(words) for key, words in itertools.groupby(data, init)])

Assuming list is already sorted (else you need to sort it first), and there would be only two such elements in the list following the criterion. You may achieve the result via using dictionary comprehension with zip as:

>>> data = ['ades', 'adey', 'adhere', 'adherent', 'admin', 'admit', 'adverb', 'advert', 'adipocere', 'adipocerous', 'adjoining', 'adjoint', 'adjudicate', 'adjudication', 'adjunct']

# data.sort()  --> if data is not already sorted
>>> {i: j for i, j in zip(data, data[1:]) if i[:-1]==j[:-1]}
{'admin': 'admit', 'adverb': 'advert', 'ades': 'adey'}

PS: I do not think regex is the right tool for achieving the desired result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM