简体   繁体   English

获取 python 文件中特定单词前的 10 个单词

[英]Get 10 words before a particular word in a file in python

I have a file which contains sentences line by line.我有一个文件,其中逐行包含句子。 I need to get 10 words before a particular word (caseinsensitive) but it can be in the previous line as well.我需要在特定单词(不区分大小写)之前得到 10 个单词,但它也可以在前一行中。 For eg: if I want the word ball and it is the fourth place of the second line then I need the 3 words in that line and 7 in the previous or even before that.例如:如果我想要单词 ball 并且它是第二行的第四位,那么我需要该行中的 3 个单词和前一行甚至之前的 7 个单词。 I can't figure out the way to get exactly 10 words from the previous lines as well.我也想不出从前几行中准确获取 10 个单词的方法。 Here is what I have so far:这是我到目前为止所拥有的:


for line in file:
            # reading each word        
            for words in line.split():
                y = 'myword'.lower
                if y = words.lower:
                    index = words.index(y)
                    i = 0, z = 0
                    for words in line[i]:
                        sentence += words
                        if str(len(sentence.split()) != 10:
                        i--
                    
                    print(sentence)                                                       
                    
                    
                      

Converting the whole file into a list of words turns out to work:将整个文件转换为单词列表是可行的:

words_list = list()
with open('text.txt', 'r') as f:
    words_list = f.read().split()

ret = str()
for word in words_list:
  if 'even' == word:
    start_index = words_list.index(word) -10
    ret = ' '.join(words_list[start_index : words_list.index(word)+1])

print(ret)

Your code may not work because lower() is a method, not an attribute.您的代码可能无法正常工作,因为lower()是一种方法,而不是属性。 Also, consider putting your word outside the loop so it does not get created every single iteration.此外,考虑将您的单词放在循环之外,这样它就不会在每次迭代时都被创建。

If your code still does not work, I've created the following which should work:如果您的代码仍然无效,我创建了以下应该有效的代码:

myword = "myword"
sentence = ""

split_sentence = s.split(" ")

for index, word in enumerate(split_sentence):
    # remove special characters
    if re.sub("[.!?,'@#$%^&*()\n]", "", word).lower() == myword:
        # make sure the start index is inbounds
        start_index = index-11 if index-11 > 0 else 0
        for word_index in range(start_index, start_index+10):
            sentence += f"{split_sentence[word_index]} "

print(sentence)

This should create a sentence with 10 words leading up to the word you're looking for, including punctuation.这应该创建一个包含 10 个词的句子,这些词指向您要查找的词,包括标点符号。 If you only need the words and not the punctuation, this should do the trick:如果您只需要单词而不需要标点符号,那么这应该可以解决问题:

myword = "myword"
sentence = ""

# remove special characters
split_sentence = re.sub("[.!?,'@#$%^&*()\n]", "", s).split(" ")

for index, word in enumerate(split_sentence):
    if word.lower() == myword:
        # make sure the start index is inbounds
        start_index = index-11 if index-11 > 0 else 0
        for word_index in range(start_index, start_index+10):
            sentence += f"{split_sentence[word_index]} "

print(sentence)

I don't know how is your file.不知道你的档案怎么样。 So, I have putted a string to simulated it.所以,我放了一个字符串来模拟它。 My version take the 10 words before and if don't have it, take all the words before and give you a final list with all of the words of the all phrases that contain the word.我的版本取之前的 10 个词,如果没有,取之前的所有词,并给你一个最终列表,其中包含包含该词的所有短语的所有词。

def get_10_words(file, word_to_find):
file_10_words_list = []
cont = 0
for line in file.lower().split('\n'):
    new_line = line.split(' ')
    for c in range(10):
        new_line.insert(0, '')
    try:
        word_index = new_line.index(word_to_find.lower())
    except ValueError:
        print(f"Line {cont + 1} hasn't got {word_to_find.title()}")
    else:
        words_before_list = [new_line[element + word_index] for element in range(-10, 0)]
        words_before_list = [element for element in words_before_list if element != '']
        file_10_words_list.append(words_before_list)
    cont += 1
return file_10_words_list

if __name__ == '__main__':
words = get_10_words('This is the line one This is the line one This is the line one Haha\n'
                     'This is the line two This is the line two This is the line two How\n'
                     'This is the line tree Haha', 'Haha')

print(words)

If there is something not clear in my code, you can ask me here!如果我的代码中有什么不清楚的地方,你可以在这里问我!

Since you tagged , here is a proposition with .由于您标记 ,这里有一个带有的命题。

#pip install spacy
#python -m spacy download en_core_web_sm
import spacy
​
with open("file.txt", "r") as f:
    text = f.read()
​
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
​
searchedWord = "StackOverflow"
​
occu = [i for i,word in enumerate(doc) if word.text == searchedWord]
​
out = []
for i in occu:
    if token.is_punct or token.is_space:
        i-=1
        w = [token.text for token in doc[i-4:i]]
        out.append(w)
    else:
        w = [token.text for token in doc[i-4:i]]
        out.append(w)

NB: In this example, we target the 4 words (while skipping punctuations and whitespaces) before the searched one.The result will be a nested list to handle the case where the word occurs more than once in the text file.注意:在这个例子中,我们定位搜索到的单词之前的 4 个单词(同时跳过标点符号和空格)。结果将是一个嵌套列表,以处理该单词在文本文件中出现多次的情况。 And we're using the english model but of course, there are many other available languages, check the list here .我们使用的是英语 model,但当然还有许多其他可用语言,请查看此处的列表。

Output: Output:

print(out)
​
#[['A', 'question', 'from', 'Whichman'], ['An', 'answer', 'from', 'Timeless']]

Input/Text-file used:使用的输入/文本文件:

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在python文件中的特定单词之前和之后打印5个单词 - printing 5 words before and after a specific word in a file in python 如何在特定符号之前提取常用词并找到特定词 - How to extract the common words before particular symbol and find particular word 基于之前和之后单词的python单词分组 - python word grouping based on words before and after python 正则表达式获取特定单词 - python regex to get particular word 如何在python中使用正则表达式在特定单词之前获取特定模式的所有日期或关键字? - How do I get all the dates or keywords of particular patterns before specific word using regular expression in python? 在特定单词之后提取单词 - Extract words after a particular word pandas 删除特定单词之前的所有单词并获取该特定单词之后的前 n 个单词 - pandas remove all words before a specific word and get the first n words after that specific word 我如何在Python中的关键搜索词之前和之后显示2个单词 - How I display 2 words before and after a key search word in Python 在python中提取特定字符串之前的2个单词,实际单词和2个字符串? - extracting the 2 words before, the actual word, and the 2 strings after a specific string in python? 对列进行标记后,获取特定单词前后的 2 个单词 - After tokenizing a column, get 2 words before and after a specific word
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM