简体   繁体   中英

Get 10 words before a particular word in a file in python

I have a file which contains sentences line by line. I need to get 10 words before a particular word (caseinsensitive) but it can be in the previous line as well. For eg: if I want the word ball and it is the fourth place of the second line then I need the 3 words in that line and 7 in the previous or even before that. I can't figure out the way to get exactly 10 words from the previous lines as well. Here is what I have so far:


for line in file:
            # reading each word        
            for words in line.split():
                y = 'myword'.lower
                if y = words.lower:
                    index = words.index(y)
                    i = 0, z = 0
                    for words in line[i]:
                        sentence += words
                        if str(len(sentence.split()) != 10:
                        i--
                    
                    print(sentence)                                                       
                    
                    
                      

Converting the whole file into a list of words turns out to work:

words_list = list()
with open('text.txt', 'r') as f:
    words_list = f.read().split()

ret = str()
for word in words_list:
  if 'even' == word:
    start_index = words_list.index(word) -10
    ret = ' '.join(words_list[start_index : words_list.index(word)+1])

print(ret)

Your code may not work because lower() is a method, not an attribute. Also, consider putting your word outside the loop so it does not get created every single iteration.

If your code still does not work, I've created the following which should work:

myword = "myword"
sentence = ""

split_sentence = s.split(" ")

for index, word in enumerate(split_sentence):
    # remove special characters
    if re.sub("[.!?,'@#$%^&*()\n]", "", word).lower() == myword:
        # make sure the start index is inbounds
        start_index = index-11 if index-11 > 0 else 0
        for word_index in range(start_index, start_index+10):
            sentence += f"{split_sentence[word_index]} "

print(sentence)

This should create a sentence with 10 words leading up to the word you're looking for, including punctuation. If you only need the words and not the punctuation, this should do the trick:

myword = "myword"
sentence = ""

# remove special characters
split_sentence = re.sub("[.!?,'@#$%^&*()\n]", "", s).split(" ")

for index, word in enumerate(split_sentence):
    if word.lower() == myword:
        # make sure the start index is inbounds
        start_index = index-11 if index-11 > 0 else 0
        for word_index in range(start_index, start_index+10):
            sentence += f"{split_sentence[word_index]} "

print(sentence)

I don't know how is your file. So, I have putted a string to simulated it. My version take the 10 words before and if don't have it, take all the words before and give you a final list with all of the words of the all phrases that contain the word.

def get_10_words(file, word_to_find):
file_10_words_list = []
cont = 0
for line in file.lower().split('\n'):
    new_line = line.split(' ')
    for c in range(10):
        new_line.insert(0, '')
    try:
        word_index = new_line.index(word_to_find.lower())
    except ValueError:
        print(f"Line {cont + 1} hasn't got {word_to_find.title()}")
    else:
        words_before_list = [new_line[element + word_index] for element in range(-10, 0)]
        words_before_list = [element for element in words_before_list if element != '']
        file_10_words_list.append(words_before_list)
    cont += 1
return file_10_words_list

if __name__ == '__main__':
words = get_10_words('This is the line one This is the line one This is the line one Haha\n'
                     'This is the line two This is the line two This is the line two How\n'
                     'This is the line tree Haha', 'Haha')

print(words)

If there is something not clear in my code, you can ask me here!

Since you tagged , here is a proposition with .

#pip install spacy
#python -m spacy download en_core_web_sm
import spacy
​
with open("file.txt", "r") as f:
    text = f.read()
​
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
​
searchedWord = "StackOverflow"
​
occu = [i for i,word in enumerate(doc) if word.text == searchedWord]
​
out = []
for i in occu:
    if token.is_punct or token.is_space:
        i-=1
        w = [token.text for token in doc[i-4:i]]
        out.append(w)
    else:
        w = [token.text for token in doc[i-4:i]]
        out.append(w)

NB: In this example, we target the 4 words (while skipping punctuations and whitespaces) before the searched one.The result will be a nested list to handle the case where the word occurs more than once in the text file. And we're using the english model but of course, there are many other available languages, check the list here .

Output:

print(out)
​
#[['A', 'question', 'from', 'Whichman'], ['An', 'answer', 'from', 'Timeless']]

Input/Text-file used:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM