简体   繁体   中英

Finding the number of characters between two words in a text in Python

How to find the nearest distance of two words in a text or in a large collection of text files.

For example, I want to find the nearest distance of two words - like "is" and "are" - in a text. Here what I have:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."

def dis_words_text(text, word1,word2):
    import numpy as np
    ind1 = text.find(word1)
    ind2 = text.find(word2)
    dis = "at least one of the the words not in text" if -1 in (ind1,ind2) else np.abs(ind1-ind2) 
    return(dis)

dis_words_text(text, "is","are")
Output: 25

dis_words_text(text, "why","are")
Output: "at least one of the the words not in text"    

It looks like the above code considers the distance of the first "is" and "are", not the nearest distance, which should be 7 characters. Please see also Finding the position of a word in a string and How to find index of an exact word in a string in Python as references. My question here is: 1) how can I find the closest distance of two words (the number of characters between them) if words are repeated in the text, 2) the speed is also important as applied to a large number of texts.

Here is a solution to find the closest distance of two words in a text based on the number of characters:

def nearest_values_twolist(list1,list2):
    r1 = list1[0]
    r2 = list2[0]
    min_val = 1000000
    for row1 in list1:
        for row2 in list2:
            t = abs(row1 - row2)
            if t<min_val:
                min_val = t
                r1 = row1
                r2 = row2
    return(r1,r2)

def closest_distance_words(text,w1,w2):
    ind1 = [w.start(0) for w in re.finditer(r'\b'+w1+r'\b', text)]
    ind2 = [w.start(0) for w in re.finditer(r'\b'+w2+r'\b', text)]
    i1,i2 = nearest_values_twolist(ind1,ind2)
    return(abs(i2-i1))

Test:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."
closest_distance_words(text,w1,w2)

Output: 7

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM