簡體   English   中英

對於文本文件中的每個單詞,請提取周圍的5個單詞

[英]For each word in the text file, extract surrounding 5 words

對於某個單詞的每次出現,我需要通過顯示該單詞出現前后的約5個單詞來顯示上下文。

輸入內容的文本文件中occurs('stranger', 'movie.txt')單詞'stranger'的示例輸出occurs('stranger', 'movie.txt')

到目前為止,我的代碼:

def occurs(word, filename):

    infile = open(filename,'r')
    lines = infile.read().splitlines()
    infile.close()

    wordsString = ''.join(lines)
    words = wordsString.split()
    print(words)

    for i in range(len(words)):
        if words[i].find(word):
            #stuck here

我建議根據i切片words

print(words[i-5:i+6])

(這將轉到您的評論所在的位置)

或者,按照您的示例所示進行打印:

print("...", " ".join(words[i-5:i+6]), "...")

要說明前5個詞,請執行以下操作:

if i > 5:
    print("...", " ".join(words[i-5:i+6]), "...")
else:
    print("...", " ".join(words[0:i+6]), "...")

此外, find並沒有按照您的想法去做。 如果find()找不到字符串,則返回-1 ,如果在if語句中使用該字符串,則得出True 嘗試:

if word in words[i].lower():

這個檢索詞的每一次出現的指標words ,這是文件中的所有單詞的列表。 然后使用切片來獲取匹配單詞和之前和之后的5個單詞的列表。

def occurs(word, filename):
    infile = open(filename,'r')
    lines = infile.read().splitlines()
    infile.close()

    wordsString = ''.join(lines)
    words = wordsString.split()

    matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]

    for m in matches:
        l = " ".join(words[m-5:m+6])
        print(f"... {l} ...")

考慮more_itertools.adajacent工具。

特定

import more_itertools as mit


s = """\
But we did not answer him, for he was a stranger and we were not used to, strangers and were shy of them.
We were simple folk, in our village, and when a stranger was a pleasant person we were soon friends.
"""

word, distance = "stranger", 5
words = s.splitlines()[0].split()

演示

neighbors = list(mit.adjacent(lambda x: x == word, words, distance))

" ".join(word for bool_, word in neighbors if bool_)
# 'him, for he was a stranger and we were not used'

細節

more_itertools.adjacent返回一個可迭代的元組,例如( bool ,item)對。 對於滿足謂詞的字符串中的單詞,返回True布爾值。 例:

>>> neighbors
[(False, 'But'),
 ...
 (True, 'a'),
 (True, 'stranger'),
 (True, 'and'),
 ...
 (False, 'to,')]

給定distance目標單詞一定distance的結果,從結果中過濾掉相鄰單詞。

注意: more_itertools是第三方庫。 通過pip install more_itertools

每當我看到文件的滾動視圖時,我就會認為collections.deque

import collections

def occurs(needle, fname):
    with open(fname) as f:
        lines = f.readlines()

    words = iter(''.join(lines).split())

    view = collections.deque(maxlen=11)
    # prime the deque
    for _ in range(10):  # leaves an 11-length deque with 10 elements
        view.append(next(words, ""))
    for w in words:
        view.append(w)
        if view[5] == needle:
            yield list(view.copy())

請注意,此方法在文件的前5個字或后5個字中有意不處理needle名的任何邊緣情況。 關於匹配第三個單詞應該給出第一個到第九個單詞還是其他一些東西,這個問題是模棱兩可的。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM