繁体   English   中英

对于文本文件中的每个单词,请提取周围的5个单词

[英]For each word in the text file, extract surrounding 5 words

对于某个单词的每次出现,我需要通过显示该单词出现前后的约5个单词来显示上下文。

输入内容的文本文件中occurs('stranger', 'movie.txt')单词'stranger'的示例输出occurs('stranger', 'movie.txt')

到目前为止,我的代码:

def occurs(word, filename):

    infile = open(filename,'r')
    lines = infile.read().splitlines()
    infile.close()

    wordsString = ''.join(lines)
    words = wordsString.split()
    print(words)

    for i in range(len(words)):
        if words[i].find(word):
            #stuck here

我建议根据i切片words

print(words[i-5:i+6])

(这将转到您的评论所在的位置)

或者,按照您的示例所示进行打印:

print("...", " ".join(words[i-5:i+6]), "...")

要说明前5个词,请执行以下操作:

if i > 5:
    print("...", " ".join(words[i-5:i+6]), "...")
else:
    print("...", " ".join(words[0:i+6]), "...")

此外, find并没有按照您的想法去做。 如果find()找不到字符串,则返回-1 ,如果在if语句中使用该字符串,则得出True 尝试:

if word in words[i].lower():

这个检索词的每一次出现的指标words ,这是文件中的所有单词的列表。 然后使用切片来获取匹配单词和之前和之后的5个单词的列表。

def occurs(word, filename):
    infile = open(filename,'r')
    lines = infile.read().splitlines()
    infile.close()

    wordsString = ''.join(lines)
    words = wordsString.split()

    matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]

    for m in matches:
        l = " ".join(words[m-5:m+6])
        print(f"... {l} ...")

考虑more_itertools.adajacent工具。

特定

import more_itertools as mit


s = """\
But we did not answer him, for he was a stranger and we were not used to, strangers and were shy of them.
We were simple folk, in our village, and when a stranger was a pleasant person we were soon friends.
"""

word, distance = "stranger", 5
words = s.splitlines()[0].split()

演示

neighbors = list(mit.adjacent(lambda x: x == word, words, distance))

" ".join(word for bool_, word in neighbors if bool_)
# 'him, for he was a stranger and we were not used'

细节

more_itertools.adjacent返回一个可迭代的元组,例如( bool ,item)对。 对于满足谓词的字符串中的单词,返回True布尔值。 例:

>>> neighbors
[(False, 'But'),
 ...
 (True, 'a'),
 (True, 'stranger'),
 (True, 'and'),
 ...
 (False, 'to,')]

给定distance目标单词一定distance的结果,从结果中过滤掉相邻单词。

注意: more_itertools是第三方库。 通过pip install more_itertools

每当我看到文件的滚动视图时,我就会认为collections.deque

import collections

def occurs(needle, fname):
    with open(fname) as f:
        lines = f.readlines()

    words = iter(''.join(lines).split())

    view = collections.deque(maxlen=11)
    # prime the deque
    for _ in range(10):  # leaves an 11-length deque with 10 elements
        view.append(next(words, ""))
    for w in words:
        view.append(w)
        if view[5] == needle:
            yield list(view.copy())

请注意,此方法在文件的前5个字或后5个字中有意不处理needle名的任何边缘情况。 关于匹配第三个单词应该给出第一个到第九个单词还是其他一些东西,这个问题是模棱两可的。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM