使用python从文本文件中提取段落并排除目录和标题

Question

I download a paper from website and would like to use NLTK to do topic modeling with complete sentence. 我从网站上下载了一篇论文，并希望使用NLTK进行完整句子的主题建模。 Therefore, I try to exclude irrelevant words or not complete sentences in text file. 因此，我尝试排除文本文件中不相关的单词或不完整的句子。 But, I still can't remove those single word. 但是，我仍然无法删除这些单词。

For example, the format in text file 例如，文本文件中的格式

I only want to get the last sentence. 我只想说最后一句话。 and as the following code is to divide a text into a list of sentence. 以及下面的代码将文本分为句子列表。

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer.tokenize(data)
print('\n'.join(tokenizer.tokenize(data)))

But, how could I exclude those single words line by line? 但是，如何才能逐行排除这些单词呢？ Thank you 谢谢

Answer 1

This can be done by using the split method on each line of the text file. 这可以通过在文本文件的每一行上使用split方法来完成。

file_list = []
file = open('Your Text File.txt', 'r')
for line in file:
    splitted_lines = line.split(' ')
    if len(splitted_lines) > 1 :
        file_list.append(' '.join(splitted_lines))

outfile = ''.join(file_list)
file_out = open('outfile.txt', 'w')
file_out.write(outfile)

使用python从文本文件中提取段落并排除目录和标题

问题描述

1 个解决方案

解决方案1
0 2018-11-15 11:30:18

使用python从文本文件中提取段落并排除目录和标题

问题描述

1 个解决方案

解决方案1 0 2018-11-15 11:30:18

解决方案1
0 2018-11-15 11:30:18