[英]For each word in the text file, extract surrounding 5 words
對於某個單詞的每次出現,我需要通過顯示該單詞出現前后的約5個單詞來顯示上下文。
輸入內容的文本文件中occurs('stranger', 'movie.txt')
單詞'stranger'的示例輸出occurs('stranger', 'movie.txt')
:
到目前為止,我的代碼:
def occurs(word, filename):
infile = open(filename,'r')
lines = infile.read().splitlines()
infile.close()
wordsString = ''.join(lines)
words = wordsString.split()
print(words)
for i in range(len(words)):
if words[i].find(word):
#stuck here
我建議根據i
切片words
:
print(words[i-5:i+6])
(這將轉到您的評論所在的位置)
或者,按照您的示例所示進行打印:
print("...", " ".join(words[i-5:i+6]), "...")
要說明前5個詞,請執行以下操作:
if i > 5:
print("...", " ".join(words[i-5:i+6]), "...")
else:
print("...", " ".join(words[0:i+6]), "...")
此外, find
並沒有按照您的想法去做。 如果find()
找不到字符串,則返回-1
,如果在if語句中使用該字符串,則得出True
。 嘗試:
if word in words[i].lower():
這個檢索詞的每一次出現的指標words
,這是文件中的所有單詞的列表。 然后使用切片來獲取匹配單詞和之前和之后的5個單詞的列表。
def occurs(word, filename):
infile = open(filename,'r')
lines = infile.read().splitlines()
infile.close()
wordsString = ''.join(lines)
words = wordsString.split()
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-5:m+6])
print(f"... {l} ...")
特定
import more_itertools as mit
s = """\
But we did not answer him, for he was a stranger and we were not used to, strangers and were shy of them.
We were simple folk, in our village, and when a stranger was a pleasant person we were soon friends.
"""
word, distance = "stranger", 5
words = s.splitlines()[0].split()
演示
neighbors = list(mit.adjacent(lambda x: x == word, words, distance))
" ".join(word for bool_, word in neighbors if bool_)
# 'him, for he was a stranger and we were not used'
細節
more_itertools.adjacent
返回一個可迭代的元組,例如( bool
,item)對。 對於滿足謂詞的字符串中的單詞,返回True
布爾值。 例:
>>> neighbors
[(False, 'But'),
...
(True, 'a'),
(True, 'stranger'),
(True, 'and'),
...
(False, 'to,')]
給定distance
目標單詞一定distance
的結果,從結果中過濾掉相鄰單詞。
注意: more_itertools
是第三方庫。 通過pip install more_itertools
。
每當我看到文件的滾動視圖時,我就會認為collections.deque
import collections
def occurs(needle, fname):
with open(fname) as f:
lines = f.readlines()
words = iter(''.join(lines).split())
view = collections.deque(maxlen=11)
# prime the deque
for _ in range(10): # leaves an 11-length deque with 10 elements
view.append(next(words, ""))
for w in words:
view.append(w)
if view[5] == needle:
yield list(view.copy())
請注意,此方法在文件的前5個字或后5個字中有意不處理needle
名的任何邊緣情況。 關於匹配第三個單詞應該給出第一個到第九個單詞還是其他一些東西,這個問題是模棱兩可的。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.