简体   繁体   English

在 Python 中查找文本中两个单词之间的字符数

[英]Finding the number of characters between two words in a text in Python

How to find the nearest distance of two words in a text or in a large collection of text files.如何在文本或大量文本文件中找到两个单词的最近距离。

For example, I want to find the nearest distance of two words - like "is" and "are" - in a text.例如,我想在文本中找到两个单词(如“is”和“are”)的最近距离。 Here what I have:这是我所拥有的:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."

def dis_words_text(text, word1,word2):
    import numpy as np
    ind1 = text.find(word1)
    ind2 = text.find(word2)
    dis = "at least one of the the words not in text" if -1 in (ind1,ind2) else np.abs(ind1-ind2) 
    return(dis)

dis_words_text(text, "is","are")
Output: 25

dis_words_text(text, "why","are")
Output: "at least one of the the words not in text"    

It looks like the above code considers the distance of the first "is" and "are", not the nearest distance, which should be 7 characters.看起来上面的代码考虑了第一个“is”和“are”的距离,而不是最近的距离,应该是7个字符。 Please see also Finding the position of a word in a string and How to find index of an exact word in a string in Python as references.另请参阅查找字符串中单词的 position如何在 Python 中查找字符串中精确单词的索引作为参考。 My question here is: 1) how can I find the closest distance of two words (the number of characters between them) if words are repeated in the text, 2) the speed is also important as applied to a large number of texts.我的问题是:1)如果单词在文本中重复,我如何找到两个单词的最近距离(它们之间的字符数),2)速度对于大量文本也很重要。

Here is a solution to find the closest distance of two words in a text based on the number of characters:这是根据字符数查找文本中两个单词的最近距离的解决方案:

def nearest_values_twolist(list1,list2):
    r1 = list1[0]
    r2 = list2[0]
    min_val = 1000000
    for row1 in list1:
        for row2 in list2:
            t = abs(row1 - row2)
            if t<min_val:
                min_val = t
                r1 = row1
                r2 = row2
    return(r1,r2)

def closest_distance_words(text,w1,w2):
    ind1 = [w.start(0) for w in re.finditer(r'\b'+w1+r'\b', text)]
    ind2 = [w.start(0) for w in re.finditer(r'\b'+w2+r'\b', text)]
    i1,i2 = nearest_values_twolist(ind1,ind2)
    return(abs(i2-i1))

Test:测试:

text = "is there a way to find the nearest distance of two words - like is and are - from each other."
closest_distance_words(text,w1,w2)

Output: 7 Output:7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当两个单词中的一个更改时,在Python中的两个指定单词之间查找文本 - finding text between two specified words in Python, when one of the two words changes 计算Python标点符号之间的单词数 - Counting number of words between punctuation characters in Python 使用 python 混淆文本文件 - 通过反转单词并在它们之间插入特定数量的随机字符 - obfuscation of a text file using python - by reversing the words and inserting a specific number of random characters between them python在文本中的两点之间替换单词 - python substitute words between two points in a text 两个单词之间的Python文本解析 - Python text parsing between two words 在NLTK中找到两个文本语料库之间的常用词 - Finding the common words between two text corpus in NLTK 如何在python中使用正则表达式来捕获两个单词之间的字符? - How to use regular expressions in python to capture the characters between two words? 如果在文本文件中的两个不同字符之间,Python - If between two different characters in a text file, Python 在python中查找文件中的单词数 - Finding number of words in a file in python 使用 Python 查找给定文本文件的第 i 个句子中的字符数 - Finding the number of characters in the ith sentence of a given text file using Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM