获取 python 文件中特定单词前的 10 个单词

Question

我有一个文件，其中逐行包含句子。 我需要在特定单词（不区分大小写）之前得到 10 个单词，但它也可以在前一行中。 例如：如果我想要单词 ball 并且它是第二行的第四位，那么我需要该行中的 3 个单词和前一行甚至之前的 7 个单词。 我也想不出从前几行中准确获取 10 个单词的方法。 这是我到目前为止所拥有的：


for line in file:
            # reading each word        
            for words in line.split():
                y = 'myword'.lower
                if y = words.lower:
                    index = words.index(y)
                    i = 0, z = 0
                    for words in line[i]:
                        sentence += words
                        if str(len(sentence.split()) != 10:
                        i--
                    
                    print(sentence)

Answer 1

将整个文件转换为单词列表是可行的：

words_list = list()
with open('text.txt', 'r') as f:
    words_list = f.read().split()

ret = str()
for word in words_list:
  if 'even' == word:
    start_index = words_list.index(word) -10
    ret = ' '.join(words_list[start_index : words_list.index(word)+1])

print(ret)

Answer 2

您的代码可能无法正常工作，因为lower()是一种方法，而不是属性。 此外，考虑将您的单词放在循环之外，这样它就不会在每次迭代时都被创建。

如果您的代码仍然无效，我创建了以下应该有效的代码：

myword = "myword"
sentence = ""

split_sentence = s.split(" ")

for index, word in enumerate(split_sentence):
    # remove special characters
    if re.sub("[.!?,'@#$%^&*()\n]", "", word).lower() == myword:
        # make sure the start index is inbounds
        start_index = index-11 if index-11 > 0 else 0
        for word_index in range(start_index, start_index+10):
            sentence += f"{split_sentence[word_index]} "

print(sentence)

这应该创建一个包含 10 个词的句子，这些词指向您要查找的词，包括标点符号。 如果您只需要单词而不需要标点符号，那么这应该可以解决问题：

myword = "myword"
sentence = ""

# remove special characters
split_sentence = re.sub("[.!?,'@#$%^&*()\n]", "", s).split(" ")

for index, word in enumerate(split_sentence):
    if word.lower() == myword:
        # make sure the start index is inbounds
        start_index = index-11 if index-11 > 0 else 0
        for word_index in range(start_index, start_index+10):
            sentence += f"{split_sentence[word_index]} "

print(sentence)

Answer 3

不知道你的档案怎么样。 所以，我放了一个字符串来模拟它。 我的版本取之前的 10 个词，如果没有，取之前的所有词，并给你一个最终列表，其中包含包含该词的所有短语的所有词。

def get_10_words(file, word_to_find):
file_10_words_list = []
cont = 0
for line in file.lower().split('\n'):
    new_line = line.split(' ')
    for c in range(10):
        new_line.insert(0, '')
    try:
        word_index = new_line.index(word_to_find.lower())
    except ValueError:
        print(f"Line {cont + 1} hasn't got {word_to_find.title()}")
    else:
        words_before_list = [new_line[element + word_index] for element in range(-10, 0)]
        words_before_list = [element for element in words_before_list if element != '']
        file_10_words_list.append(words_before_list)
    cont += 1
return file_10_words_list

if __name__ == '__main__':
words = get_10_words('This is the line one This is the line one This is the line one Haha\n'
                     'This is the line two This is the line two This is the line two How\n'
                     'This is the line tree Haha', 'Haha')

print(words)

如果我的代码中有什么不清楚的地方，你可以在这里问我！

Answer 4

由于您标记了 nlp ，这里有一个带有spacy的命题。

#pip install spacy
#python -m spacy download en_core_web_sm
import spacy

with open("file.txt", "r") as f:
    text = f.read()

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

searchedWord = "StackOverflow"

occu = [i for i,word in enumerate(doc) if word.text == searchedWord]

out = []
for i in occu:
    if token.is_punct or token.is_space:
        i-=1
        w = [token.text for token in doc[i-4:i]]
        out.append(w)
    else:
        w = [token.text for token in doc[i-4:i]]
        out.append(w)

注意：在这个例子中，我们定位搜索到的单词之前的 4 个单词（同时跳过标点符号和空格）。结果将是一个嵌套列表，以处理该单词在文本文件中出现多次的情况。 我们使用的是英语 model，但当然还有许多其他可用语言，请查看此处的列表。

Output：

print(out)

#[['A', 'question', 'from', 'Whichman'], ['An', 'answer', 'from', 'Timeless']]

使用的输入/文本文件：

获取 python 文件中特定单词前的 10 个单词

问题描述

4 个解决方案

解决方案1
0 2023-01-27 14:08:20

解决方案2
0 2023-01-27 14:10:48

解决方案3
0 已采纳 2023-01-27 14:44:48

解决方案4
0 2023-01-27 14:56:01

获取 python 文件中特定单词前的 10 个单词

问题描述

4 个解决方案

解决方案1 0 2023-01-27 14:08:20

解决方案2 0 2023-01-27 14:10:48

解决方案3 0 已采纳 2023-01-27 14:44:48

解决方案4 0 2023-01-27 14:56:01

解决方案1
0 2023-01-27 14:08:20

解决方案2
0 2023-01-27 14:10:48

解决方案3
0 已采纳 2023-01-27 14:44:48

解决方案4
0 2023-01-27 14:56:01