简体   繁体   English

Python-删除列表中包含其他单词的所有单词

[英]Python- Remove all words that contain other words in a list

I have a list populated with words from a dictionary. 我有一个填充了字典中的单词的列表。 I want to find a way to remove all words, only considering root words that form at the beginning of the target word. 我想找到一种方法来删除所有单词,只考虑在目标单词开头形成的根单词。

For example, the word "rodeo" would be removed from the list because it contains the English-valid word "rode." 例如,单词“rodeo”将从列表中删除,因为它包含英语有效单词“rode”。 "Typewriter" would be removed because it contains the English-valid word "type." “打字机”将被删除,因为它包含英文有效的单词“type”。 However, the word "snicker" is still valid even if it contains the word "nick" because "nick" is in the middle and not at the beginning of the word. 但是,单词“snicker”仍然有效,即使它包含单词“nick”,因为“nick”位于单词的中间而不是单词的开头。

I was thinking something like this: 我在想这样的事情:

 for line in wordlist:
        if line.find(...) --

but I want that "if" statement to then run through every single word in the list checking to see if its found and, if so, remove itself from the list so that only root words remain. 但是我希望那个“if”语句然后遍历列表中的每个单词检查它是否找到它,如果是,则从列表中删除它自己,这样只剩下根词。 Do I have to create a copy of wordlist to traverse? 我是否必须创建wordlist的副本才能遍历?

So you have two lists: the list of words you want to check and possibly remove, and a list of valid words. 因此,您有两个列表:要检查和可能删除的单词列表,以及有效单词列表。 If you like, you can use the same list for both purposes, but I'll assume you have two lists. 如果您愿意,可以将相同的列表用于这两个目的,但我假设您有两个列表。

For speed, you should turn your list of valid words into a set. 对于速度,您应该将有效单词列表转换为集合。 Then you can very quickly check to see if any particular word is in that set. 然后,您可以非常快速地检查该集合中是否有任何特定单词。 Then, take each word, and check whether all its prefixes exist in the valid words list or not. 然后,取出每个单词,并检查它的所有前缀是否都存在于有效单词列表中。 Since "a" and "I" are valid words in English, will you remove all valid words starting with 'a', or will you have a rule that sets a minimum length for the prefix? 由于“a”和“I”是英语中的有效单词,您是否会删除以“a”开头的所有有效单词,或者您是否有规则设置前缀的最小长度?

I am using the file /usr/share/dict/words from my Ubuntu install. 我在我的Ubuntu安装中使用文件/ usr / share / dict / words。 This file has all sorts of odd things in it; 这个文件中有各种奇怪的东西; for example, it seems to contain every letter by itself as a word. 例如,它似乎包含每个字母本身作为一个单词。 Thus "k" is in there, "q", "z", etc. None of these are words as far as I know, but they are probably in there for some technical reason. 因此,“k”在那里,“q”,“z”等等。据我所知,这些都不是单词,但出于某些技术原因它们可能在那里。 Anyway, I decided to simply exclude anything shorter than three letters from my valid words list. 无论如何,我决定从我的有效单词列表中简单地排除任何短于三个字母的内容。

Here is what I came up with: 这是我想出的:

# build valid list from /usr/dict/share/words
wfile = "/usr/dict/share/words"
valid = set(line.strip() for line in open(wfile) if len(line) >= 3)

lst = ["ark", "booze", "kite", "live", "rodeo"]

def subwords(word):
    for i in range(len(word) - 1, 0, -1):
        w = word[:i]
        yield w

newlst = []
for word in lst:
    # uncomment these for debugging to make sure it works
    # print "subwords", [w for w in subwords(word)]
    # print "valid subwords", [w for w in subwords(word) if w in valid]
    if not any(w in valid for w in subwords(word)):
        newlst.append(word)

print(newlst)

If you are a fan of one-liners, you could do away with the for list and use a list comprehension: 如果你是单行的粉丝,你可以取消for列表并使用列表理解:

newlst = [word for word in lst if not any(w in valid for w in subwords(word))]

I think that's more terse than it should be, and I like being able to put in the print statements to debug. 我认为这比它应该更简洁,我喜欢能够输入print语句进行调试。

Hmm, come to think of it, it's not too terse if you just add another function: 嗯,来想一想,如果你只是添加另一个功能,它不是太简洁:

def keep(word):
    return not any(w in valid for w in subwords(word))

newlst = [word for word in lst if keep(word)]

Python can be easy to read and understand if you make functions like this, and give them good names. 如果你创建这样的函数,Python可以很容易阅读和理解,并给它们好名字。

I'm assuming that you only have one list from which you want to remove any elements that have prefixes in that same list. 我假设您只有一个列表,您要从中删除在同一列表中具有前缀的任何元素。

#Important assumption here... wordlist is sorted

base=wordlist[0]                      #consider the first word in the list
for word in wordlist:                 #loop through the entire list checking if
    if not word.startswith(base):     # the word we're considering starts with the base
        print base                    #If not... we have a new base, print the current
        base=word                     #  one and move to this new one
    #else word starts with base
        #don't output word, and go on to the next item in the list
print base                            #finish by printing the last base

EDIT: Added some comments to make the logic more obvious 编辑:添加了一些注释,使逻辑更明显

I find jkerian's asnwer to be the best (assuming only one list) and I would like to explain why. 我发现jkerian不是最好的(假设只有一个列表),我想解释原因。

Here is my version of the code (as a function): 这是我的代码版本(作为函数):

wordlist = ["a","arc","arcane","apple","car","carpenter","cat","zebra"];

def root_words(wordlist):
    result = []
    base = wordlist[0]
    for word in wordlist:
        if not word.startswith(base):
            result.append(base)
            base=word
    result.append(base)
    return result;

print root_words(wordlist);

As long as the word list is sorted (you could do this in the function if you wanted to), this will get the result in a single parse. 只要对单词列表进行排序(如果您愿意,可以在函数中执行此操作),这将在单个解析中获得结果。 This is because when you sort the list, all words made up of another word in the list, will be directly after that root word. 这是因为当您对列表进行排序时,由列表中的另一个单词组成的所有单词将直接在该根单词之后。 eg anything that falls between "arc" and "arcane" in your particular list, will also be eliminated because of the root word "arc". 例如,在你的特定列表中落在“arc”和“arcane”之间的任何东西,也会因为根词“arc”而被删除。

You should use the built-in lambda function for this. 你应该使用内置的lambda函数。 I think it'll make your life a lot easier 我认为它会让你的生活变得更轻松

words = ['rode', 'nick'] # this is the list of all the words that you have.
                         # I'm using 'rode' and 'nick' as they're in your example
listOfWordsToTry = ['rodeo', 'snicker']
def validate(w):
    for word in words:
        if w.startswith(word):
            return False
    return True

wordsThatDontStartWithValidEnglishWords = \
    filter(lambda x : validate(x), listOfWordsToTry)

This should work for your purposes, unless I misunderstand your question. 这应该适用于您的目的,除非我误解了您的问题。

Hope this helps 希望这可以帮助

I wrote an answer that assumes two lists, the list to be pruned and the list of valid words. 我写了一个答案,假定有两个列表,要修剪的列表和有效单词列表。 In the discussion around my answer, I commented that maybe a trie solution would be good. 在围绕我的回答的讨论中,我评论说也许一个特里的解决方案会很好。

What the heck, I went ahead and wrote it. 到底是什么,我继续写下来。

You can read about a trie here: 你可以在这里阅读一个特里:

http://en.wikipedia.org/wiki/Trie http://en.wikipedia.org/wiki/Trie

For my Python solution, I basically used dictionaries. 对于我的Python解决方案,我基本上使用了词典。 A key is a sequence of symbols, and each symbol goes into a dict, with another Trie instance as the data. 密钥是一系列符号,每个符号进入一个字典,另一个Trie实例作为数据。 A second dictionary stores "terminal" symbols, which mark the end of a "word" in the Trie. 第二个字典存储“终端”符号,其标记Trie中“单词”的结尾。 For this example, the "words" are actually words, but in principle the words could be any sequence of hashable Python objects. 对于这个例子,“单词”实际上是单词,但原则上单词可以是任何可散列的Python对象序列。

The Wikipedia example shows a trie where the keys are letters, but can be more than a single letter; Wikipedia示例显示了一个trie,其中键是字母,但可以不止一个字母; they can be a sequence of multiple letters. 它们可以是多个字母的序列。 For simplicity, my code uses only a single symbol at a time as a key. 为简单起见,我的代码一次只使用一个符号作为键。

If you add both the word "cat" and the word "catch" to the trie, then there will be nodes for 'c', 'a', and 't' (and also the second 'c' in "catch"). 如果你在trie中添加单词“cat”和单词“catch”,那么会有'c','a'和't'的节点(以及“catch”中的第二个'c') 。 At the node level for 'a', the dictionary of "terminals" will have 't' in it (thus completing the coding for "cat"), and likewise at the deeper node level of the second 'c' the dictionary of terminals will have 'h' in it (completing "catch"). 在'a'的节点级别,“终端”的字典将在其中具有't'(从而完成对“cat”的编码),并且同样在第二'c'的更深节点级别处的终端字典将会有'h'(完成“捕获”)。 So, adding "catch" after "cat" just means one additional node and one more entry in the terminals dictionary. 因此,在“cat”之后添加“catch”只意味着在终端字典中增加一个节点和一个条目。 The trie structure makes a very efficient way to store and index a really large list of words. trie结构是一种非常有效的方法来存储和索引一个非常大的单词列表。

def _pad(n):
    return " " * n

class Trie(object):
    def __init__(self):
        self.t = {}  # dict mapping symbols to sub-tries
        self.w = {}  # dict listing terminal symbols at this level

    def add(self, word):
        if 0 == len(word):
            return
        cur = self
        for ch in word[:-1]: # add all symbols but terminal
            if ch not in cur.t:
                cur.t[ch] = Trie()
            cur = cur.t[ch]
        ch = word[-1]
        cur.w[ch] = True  # add terminal

    def prefix_match(self, word):
        if 0 == len(word):
            return False
        cur = self
        for ch in word[:-1]: # check all symbols but last one
            # If you check the last one, you are not checking a prefix,
            # you are checking whether the whole word is in the trie.
            if ch in cur.w:
                return True
            if ch not in cur.t:
                return False
            cur = cur.t[ch]  # walk down the trie to next level
        return False

    def debug_str(self, nest, s=None):
        "print trie in a convenient nested format"
        lst = []
        s_term = "".join(ch for ch in self.w)
        if 0 == nest:
            lst.append(object.__str__(self))
            lst.append("--top--: " + s_term)
        else:
            tup = (_pad(nest), s, s_term)
            lst.append("%s%s: %s" % tup)
        for ch, d in self.t.items():
            lst.append(d.debug_str(nest+1, ch))
        return "\n".join(lst)

    def __str__(self):
        return self.debug_str(0)



t = Trie()


# Build valid list from /usr/dict/share/words, which has every letter of
# the alphabet as words!  Only take 2-letter words and longer.

wfile = "/usr/share/dict/words"
for line in open(wfile):
    word = line.strip()
    if len(word) >= 2:
        t.add(word)

# add valid 1-letter English words
t.add("a")
t.add("I")



lst = ["ark", "booze", "kite", "live", "rodeo"]
# "ark" starts with "a"
# "booze" starts with "boo"
# "kite" starts with "kit"
# "live" is good: "l", "li", "liv" are not words
# "rodeo" starts with "rode"

newlst = [w for w in lst if not t.prefix_match(w)]

print(newlst)  # prints: ['live']

I only had one list - and I wanted to remove any word from it that was a prefix of another. 我只有一个列表 - 我想从它删除任何另一个字的前缀。

Here is a solution that should run in O(n log N) time and O(M) space, where M is the size of the returned list. 这是一个应该在O(n log N)时间和O(M)空间中运行的解决方案,其中M是返回列表的大小。 The runtime is dominated by the sorting. 运行时由排序控制。

l = sorted(your_list)
removed_prefixes = [l[g] for g in range(0, len(l)-1) if not l[g+1].startswith(l[g])] + l[-1:]
  • If the list is sorted then the item at index N is a prefix if it begins the item at index N+1. 如果列表已排序,则索引N处的项目如果它在索引N + 1处开始项目则是前缀。

  • At the end it appends the last item of the original sorted list, since by definition it is not a prefix. 最后,它附加原始排序列表的最后一项,因为根据定义它不是前缀。 Handling it last also allows us to iterate over an arbitrary number of indexes w/o going out of range. 最后处理它还允许我们迭代超出范围的任意数量的索引。

If you have the banned list hardcoded in another list: 如果您将禁止列表硬编码在另一个列表中:

 banned = tuple(banned_prefixes]
 removed_prefixes = [ i for i in your_list if not i.startswith(banned)]

This relies on the fact that startswith accepts a tuple. 这取决于startswith接受元组的事实。 It probably runs in something close to N * M where N is elements in list and M is elements in banned . 它可能在接近N * M的地方运行,其中N是列表中的元素,M是banned元素。 Python could conceivably be doing some smart things to make it a bit quicker. 可以想象,Python可以做一些聪明的事情来让它更快一些。 If you are like OP and want to disregard case, you will need .lower() calls in places. 如果你喜欢OP并且想要忽略大小写,你需要在某些地方使用.lower()调用。

I don't want to provide an exact solution, but I think there are two key functions in Python that will help you greatly here. 我不想提供一个精确的解决方案,但我认为Python中有两个关键功能可以在这里帮助你。

The first, jkerian mentioned: string.startswith() http://docs.python.org/library/stdtypes.html#str.startswith 第一个,jkerian提到:string.startswith() http://docs.python.org/library/stdtypes.html#str.startswith

The second: filter() http://docs.python.org/library/functions.html#filter 第二个:filter() http://docs.python.org/library/functions.html#filter

With filter, you could write a conditional function that will check to see if a word is the base of another word and return true if so. 使用过滤器,您可以编写一个条件函数,该函数将检查单词是否是另一个单词的基础,如果是,则返回true。

For each word in the list, you would need to iterate over all of the other words and evaluate the conditional using filter, which could return the proper subset of root words. 对于列表中的每个单词,您需要迭代所有其他单词并评估条件使用过滤器,它可以返回根词的正确子集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM