[英]Get 10 words before a particular word in a file in python
I have a file which contains sentences line by line.我有一个文件,其中逐行包含句子。 I need to get 10 words before a particular word (caseinsensitive) but it can be in the previous line as well.我需要在特定单词(不区分大小写)之前得到 10 个单词,但它也可以在前一行中。 For eg: if I want the word ball and it is the fourth place of the second line then I need the 3 words in that line and 7 in the previous or even before that.例如:如果我想要单词 ball 并且它是第二行的第四位,那么我需要该行中的 3 个单词和前一行甚至之前的 7 个单词。 I can't figure out the way to get exactly 10 words from the previous lines as well.我也想不出从前几行中准确获取 10 个单词的方法。 Here is what I have so far:这是我到目前为止所拥有的:
for line in file:
# reading each word
for words in line.split():
y = 'myword'.lower
if y = words.lower:
index = words.index(y)
i = 0, z = 0
for words in line[i]:
sentence += words
if str(len(sentence.split()) != 10:
i--
print(sentence)
Converting the whole file into a list of words turns out to work:将整个文件转换为单词列表是可行的:
words_list = list()
with open('text.txt', 'r') as f:
words_list = f.read().split()
ret = str()
for word in words_list:
if 'even' == word:
start_index = words_list.index(word) -10
ret = ' '.join(words_list[start_index : words_list.index(word)+1])
print(ret)
Your code may not work because lower()
is a method, not an attribute.您的代码可能无法正常工作,因为lower()
是一种方法,而不是属性。 Also, consider putting your word outside the loop so it does not get created every single iteration.此外,考虑将您的单词放在循环之外,这样它就不会在每次迭代时都被创建。
If your code still does not work, I've created the following which should work:如果您的代码仍然无效,我创建了以下应该有效的代码:
myword = "myword"
sentence = ""
split_sentence = s.split(" ")
for index, word in enumerate(split_sentence):
# remove special characters
if re.sub("[.!?,'@#$%^&*()\n]", "", word).lower() == myword:
# make sure the start index is inbounds
start_index = index-11 if index-11 > 0 else 0
for word_index in range(start_index, start_index+10):
sentence += f"{split_sentence[word_index]} "
print(sentence)
This should create a sentence with 10 words leading up to the word you're looking for, including punctuation.这应该创建一个包含 10 个词的句子,这些词指向您要查找的词,包括标点符号。 If you only need the words and not the punctuation, this should do the trick:如果您只需要单词而不需要标点符号,那么这应该可以解决问题:
myword = "myword"
sentence = ""
# remove special characters
split_sentence = re.sub("[.!?,'@#$%^&*()\n]", "", s).split(" ")
for index, word in enumerate(split_sentence):
if word.lower() == myword:
# make sure the start index is inbounds
start_index = index-11 if index-11 > 0 else 0
for word_index in range(start_index, start_index+10):
sentence += f"{split_sentence[word_index]} "
print(sentence)
I don't know how is your file.不知道你的档案怎么样。 So, I have putted a string to simulated it.所以,我放了一个字符串来模拟它。 My version take the 10 words before and if don't have it, take all the words before and give you a final list with all of the words of the all phrases that contain the word.我的版本取之前的 10 个词,如果没有,取之前的所有词,并给你一个最终列表,其中包含包含该词的所有短语的所有词。
def get_10_words(file, word_to_find):
file_10_words_list = []
cont = 0
for line in file.lower().split('\n'):
new_line = line.split(' ')
for c in range(10):
new_line.insert(0, '')
try:
word_index = new_line.index(word_to_find.lower())
except ValueError:
print(f"Line {cont + 1} hasn't got {word_to_find.title()}")
else:
words_before_list = [new_line[element + word_index] for element in range(-10, 0)]
words_before_list = [element for element in words_before_list if element != '']
file_10_words_list.append(words_before_list)
cont += 1
return file_10_words_list
if __name__ == '__main__':
words = get_10_words('This is the line one This is the line one This is the line one Haha\n'
'This is the line two This is the line two This is the line two How\n'
'This is the line tree Haha', 'Haha')
print(words)
If there is something not clear in my code, you can ask me here!如果我的代码中有什么不清楚的地方,你可以在这里问我!
Since you tagged nlp , here is a proposition with spacy .由于您标记了 nlp ,这里有一个带有spacy的命题。
#pip install spacy
#python -m spacy download en_core_web_sm
import spacy
with open("file.txt", "r") as f:
text = f.read()
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
searchedWord = "StackOverflow"
occu = [i for i,word in enumerate(doc) if word.text == searchedWord]
out = []
for i in occu:
if token.is_punct or token.is_space:
i-=1
w = [token.text for token in doc[i-4:i]]
out.append(w)
else:
w = [token.text for token in doc[i-4:i]]
out.append(w)
NB: In this example, we target the 4 words (while skipping punctuations and whitespaces) before the searched one.The result will be a nested list to handle the case where the word occurs more than once in the text file.注意:在这个例子中,我们定位搜索到的单词之前的 4 个单词(同时跳过标点符号和空格)。结果将是一个嵌套列表,以处理该单词在文本文件中出现多次的情况。 And we're using the english model but of course, there are many other available languages, check the list here .我们使用的是英语 model,但当然还有许多其他可用语言,请查看此处的列表。
Output: Output:
print(out)
#[['A', 'question', 'from', 'Whichman'], ['An', 'answer', 'from', 'Timeless']]
Input/Text-file used:使用的输入/文本文件:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.