[英]Finding the number of characters between two words in a text in Python
How to find the nearest distance of two words in a text or in a large collection of text files.如何在文本或大量文本文件中找到两个单词的最近距离。
For example, I want to find the nearest distance of two words - like "is" and "are" - in a text.例如,我想在文本中找到两个单词(如“is”和“are”)的最近距离。 Here what I have:
这是我所拥有的:
text = "is there a way to find the nearest distance of two words - like is and are - from each other."
def dis_words_text(text, word1,word2):
import numpy as np
ind1 = text.find(word1)
ind2 = text.find(word2)
dis = "at least one of the the words not in text" if -1 in (ind1,ind2) else np.abs(ind1-ind2)
return(dis)
dis_words_text(text, "is","are")
Output: 25
dis_words_text(text, "why","are")
Output: "at least one of the the words not in text"
It looks like the above code considers the distance of the first "is" and "are", not the nearest distance, which should be 7 characters.看起来上面的代码考虑了第一个“is”和“are”的距离,而不是最近的距离,应该是7个字符。 Please see also Finding the position of a word in a string and How to find index of an exact word in a string in Python as references.
另请参阅查找字符串中单词的 position和如何在 Python 中查找字符串中精确单词的索引作为参考。 My question here is: 1) how can I find the closest distance of two words (the number of characters between them) if words are repeated in the text, 2) the speed is also important as applied to a large number of texts.
我的问题是:1)如果单词在文本中重复,我如何找到两个单词的最近距离(它们之间的字符数),2)速度对于大量文本也很重要。
Here is a solution to find the closest distance of two words in a text based on the number of characters:这是根据字符数查找文本中两个单词的最近距离的解决方案:
def nearest_values_twolist(list1,list2):
r1 = list1[0]
r2 = list2[0]
min_val = 1000000
for row1 in list1:
for row2 in list2:
t = abs(row1 - row2)
if t<min_val:
min_val = t
r1 = row1
r2 = row2
return(r1,r2)
def closest_distance_words(text,w1,w2):
ind1 = [w.start(0) for w in re.finditer(r'\b'+w1+r'\b', text)]
ind2 = [w.start(0) for w in re.finditer(r'\b'+w2+r'\b', text)]
i1,i2 = nearest_values_twolist(ind1,ind2)
return(abs(i2-i1))
Test:测试:
text = "is there a way to find the nearest distance of two words - like is and are - from each other."
closest_distance_words(text,w1,w2)
Output: 7 Output:7
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.