繁体   English   中英

在 Python 中查找文本中两个单词之间的字符数

[英]Finding the number of characters between two words in a text in Python

如何在文本或大量文本文件中找到两个单词的最近距离。

例如,我想在文本中找到两个单词(如“is”和“are”)的最近距离。 这是我所拥有的:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."

def dis_words_text(text, word1,word2):
    import numpy as np
    ind1 = text.find(word1)
    ind2 = text.find(word2)
    dis = "at least one of the the words not in text" if -1 in (ind1,ind2) else np.abs(ind1-ind2) 
    return(dis)

dis_words_text(text, "is","are")
Output: 25

dis_words_text(text, "why","are")
Output: "at least one of the the words not in text"    

看起来上面的代码考虑了第一个“is”和“are”的距离,而不是最近的距离,应该是7个字符。 另请参阅查找字符串中单词的 position如何在 Python 中查找字符串中精确单词的索引作为参考。 我的问题是:1)如果单词在文本中重复,我如何找到两个单词的最近距离(它们之间的字符数),2)速度对于大量文本也很重要。

这是根据字符数查找文本中两个单词的最近距离的解决方案:

def nearest_values_twolist(list1,list2):
    r1 = list1[0]
    r2 = list2[0]
    min_val = 1000000
    for row1 in list1:
        for row2 in list2:
            t = abs(row1 - row2)
            if t<min_val:
                min_val = t
                r1 = row1
                r2 = row2
    return(r1,r2)

def closest_distance_words(text,w1,w2):
    ind1 = [w.start(0) for w in re.finditer(r'\b'+w1+r'\b', text)]
    ind2 = [w.start(0) for w in re.finditer(r'\b'+w2+r'\b', text)]
    i1,i2 = nearest_values_twolist(ind1,ind2)
    return(abs(i2-i1))

测试:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."
closest_distance_words(text,w1,w2)

Output:7

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM