[英]finding text between two specified words in Python, when one of the two words changes
[英]Finding the number of characters between two words in a text in Python
如何在文本或大量文本文件中找到两个单词的最近距离。
例如,我想在文本中找到两个单词(如“is”和“are”)的最近距离。 这是我所拥有的:
text = "is there a way to find the nearest distance of two words - like is and are - from each other."
def dis_words_text(text, word1,word2):
import numpy as np
ind1 = text.find(word1)
ind2 = text.find(word2)
dis = "at least one of the the words not in text" if -1 in (ind1,ind2) else np.abs(ind1-ind2)
return(dis)
dis_words_text(text, "is","are")
Output: 25
dis_words_text(text, "why","are")
Output: "at least one of the the words not in text"
看起来上面的代码考虑了第一个“is”和“are”的距离,而不是最近的距离,应该是7个字符。 另请参阅查找字符串中单词的 position和如何在 Python 中查找字符串中精确单词的索引作为参考。 我的问题是:1)如果单词在文本中重复,我如何找到两个单词的最近距离(它们之间的字符数),2)速度对于大量文本也很重要。
这是根据字符数查找文本中两个单词的最近距离的解决方案:
def nearest_values_twolist(list1,list2):
r1 = list1[0]
r2 = list2[0]
min_val = 1000000
for row1 in list1:
for row2 in list2:
t = abs(row1 - row2)
if t<min_val:
min_val = t
r1 = row1
r2 = row2
return(r1,r2)
def closest_distance_words(text,w1,w2):
ind1 = [w.start(0) for w in re.finditer(r'\b'+w1+r'\b', text)]
ind2 = [w.start(0) for w in re.finditer(r'\b'+w2+r'\b', text)]
i1,i2 = nearest_values_twolist(ind1,ind2)
return(abs(i2-i1))
测试:
text = "is there a way to find the nearest distance of two words - like is and are - from each other."
closest_distance_words(text,w1,w2)
Output:7
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.