简体   繁体   中英

fast way to search for a set of words in a list of words python

I have a set of fixed words of size 20. I have a large file of 20,000 records, where each record contains a string and I want to find if any word from the fixed set is present in a string and if present the index of the word.

example

s1=set([barely,rarely, hardly])#( actual size 20) 

l2= =["i hardly visit", "i do not visit", "i can barely talk"] #( actual size 20,000)

def get_token_index(token,indx):
    if token in s1:
        return indx
    else:
        return -1


def find_word(text):
    tokens=nltk.word_tokenize(text)
    indexlist=[]
    for i in range(0,len(tokens)):
        indexlist.append(i)
    word_indx=map(get_token_index,tokens,indexlist)    
    for indx in word_indx:
        if indx !=-1:
           # Do Something with tokens[indx]

I want to know if there is a better/faster way to do it.

You can use list comprehension with a double for loop:

s1=set(["barely","rarely", "hardly"])

l2 = ["i hardly visit", "i do not visit", "i can barely talk"]

locations = [c for c, b in enumerate(l2) for a in s1 if a in b]

In this example, the output would be:

[0, 2]

However, if you would like a way of accessing the indexes at which a certain word appears:

from collections import defaultdict

d = defaultdict(list)

for word in s1:
   for index, sentence in l2:
       if word in sentence:
           d[word].append(index)

This suggesting is only removing some glaring inefficiencies, but won't affect the overall complexity of your solution:

def find_word(text, s1=s1): # micro-optimization, make s1 local
    tokens = nltk.word_tokenize(text)    
    for i, word in in enumerate(tokens):
        if word in s1:
           # Do something with `word` and `i`

Essentially, you are slowing things down by using map when all you really need is a condition inside your loop body anyway... So basically, just get rid of get_token_index , it is over-engineered.

This should work:

strings = []
for string in l2:
    words = string.split(' ')
    for s in s1:
        if s in words:
            print "%s at index %d" % (s, words.index(s))

The Easiest Way and Slightly More Efficient way would be using the Python Generator Function

index_tuple = list((l2.index(i) for i in s1 i in l2))

you can time it and check how efficiently this works with your requirement

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM