簡體   English   中英

在 Python 中查找文本中兩個單詞之間的字符數

[英]Finding the number of characters between two words in a text in Python

如何在文本或大量文本文件中找到兩個單詞的最近距離。

例如,我想在文本中找到兩個單詞(如“is”和“are”)的最近距離。 這是我所擁有的:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."

def dis_words_text(text, word1,word2):
    import numpy as np
    ind1 = text.find(word1)
    ind2 = text.find(word2)
    dis = "at least one of the the words not in text" if -1 in (ind1,ind2) else np.abs(ind1-ind2) 
    return(dis)

dis_words_text(text, "is","are")
Output: 25

dis_words_text(text, "why","are")
Output: "at least one of the the words not in text"    

看起來上面的代碼考慮了第一個“is”和“are”的距離,而不是最近的距離,應該是7個字符。 另請參閱查找字符串中單詞的 position如何在 Python 中查找字符串中精確單詞的索引作為參考。 我的問題是:1)如果單詞在文本中重復,我如何找到兩個單詞的最近距離(它們之間的字符數),2)速度對於大量文本也很重要。

這是根據字符數查找文本中兩個單詞的最近距離的解決方案:

def nearest_values_twolist(list1,list2):
    r1 = list1[0]
    r2 = list2[0]
    min_val = 1000000
    for row1 in list1:
        for row2 in list2:
            t = abs(row1 - row2)
            if t<min_val:
                min_val = t
                r1 = row1
                r2 = row2
    return(r1,r2)

def closest_distance_words(text,w1,w2):
    ind1 = [w.start(0) for w in re.finditer(r'\b'+w1+r'\b', text)]
    ind2 = [w.start(0) for w in re.finditer(r'\b'+w2+r'\b', text)]
    i1,i2 = nearest_values_twolist(ind1,ind2)
    return(abs(i2-i1))

測試:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."
closest_distance_words(text,w1,w2)

Output:7

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM