查找不完整文本中所有出現的字符串

Question

我試圖在從PDF文件中提取的長文本中找到一個字符串，並獲取該字符串在文本中的位置，然后在該字符串之前返回100個單詞，在之后返回100個單詞。 問題是提取不是完美的，所以我遇到這樣的問題：

查詢字符串是“測試文本”

文本可能如下所示：

這是一個有問題的測試文本

如您所見，單詞“ test”與字母“ a”相連，單詞“ text”與單詞“ with”相連

因此，與我一起使用的唯一功能是__contains __，它不返回單詞的位置。

有什么想法可以找到這樣一個文本中所有單詞的出現及其位置嗎？

非常感謝你

Answer 1

您沒有指定所有要求，但這可以解決當前問題。 程序將輸出9 and 42 ，這是兩次出現test text 。

import re
filt = re.compile("test text")

for match in filt.finditer('This is atest textwith a problem. another test text'):
    print match.start()

Answer 2

您可以采用以下方法。 這首先嘗試將整個文本拆分為多個單詞，並記下每個單詞的索引。

接下來，遍歷文本以查找可能存在0或更多空格的test text 。 對於每個匹配項，它記下起點，然后使用Python的bisect庫創建在此之前和之后找到的words列表，以在words列表中找到所需的條目。

import bisect
import re

test = "aa bb cc dd test text ee ff gg testtextwith hh ii jj"

words = [(w.start(), w.group(0)) for w in re.finditer(r'(\b\w+?\b)', test)]

adjacent_words = 2

for match in re.finditer(r'(test\s*?text)', test):
    start, end = match.span()

    words_start = bisect.bisect_left(words, (start, ''))
    words_end = bisect.bisect_right(words, (end, ''))

    words_before = [w for i, w in words[words_start-adjacent_words : words_start]]
    words_after = [w for i, w in words[words_end : words_end + adjacent_words]]

    #  Adjacent words as a list
    print words_before, match.group(0), words_after

    # Or, surrounding text as is.
    print test[words[words_start-adjacent_words][0] : words[words_end+adjacent_words][0]]

    print

因此，對於具有兩個相鄰單詞的示例，您將獲得以下輸出：

['cc', 'dd'] test text ['ee', 'ff']
cc dd test text ee ff 

['ff', 'gg'] testtext ['hh', 'ii']
ff gg testtextwith hh ii

Answer 3

如果要在字符串中查找文本的位置，則可以使用string.find() 。

>>> query_string = 'test text'
>>> text = 'This is atest textwith a problem'
>>> if query_string in text:
        print text.find(query_string)
9

Answer 4

您可能會看看允許“模糊”匹配的regex模塊：

>>> import regex
>>> s='This is atest textwith a problem'
>>> regex.search(r'(?:text with){e<2}', s)
<regex.Match object; span=(14, 22), match='textwith', fuzzy_counts=(0, 0, 1)>
>>> regex.search(r'(?:test text){e<2}', s)
<regex.Match object; span=(8, 18), match='atest text', fuzzy_counts=(0, 1, 0)>

您可以匹配具有插入，刪除和錯誤的文本。 返回的匹配組具有范圍和索引。

您可以使用regex.findall查找所有潛在的目標匹配項。

非常適合您所描述的內容。

查找不完整文本中所有出現的字符串

問題描述

4 個解決方案

解決方案1
4 已采納 2016-10-12 14:44:28

解決方案2
3 2016-10-12 15:45:56

解決方案3
2 2016-10-12 14:41:27

解決方案4
1 2016-10-12 14:57:00

查找不完整文本中所有出現的字符串

問題描述

4 個解決方案

解決方案1 4 已采納 2016-10-12 14:44:28

解決方案2 3 2016-10-12 15:45:56

解決方案3 2 2016-10-12 14:41:27

解決方案4 1 2016-10-12 14:57:00

解決方案1
4 已采納 2016-10-12 14:44:28

解決方案2
3 2016-10-12 15:45:56

解決方案3
2 2016-10-12 14:41:27

解決方案4
1 2016-10-12 14:57:00