简体   繁体   English

使用python从文本文件中提取段落并排除目录和标题

[英]Using python to extract the paragraph from text file and to exclude catalog and title

I download a paper from website and would like to use NLTK to do topic modeling with complete sentence. 我从网站上下载了一篇论文,并希望使用NLTK进行完整句子的主题建模。 Therefore, I try to exclude irrelevant words or not complete sentences in text file. 因此,我尝试排除文本文件中不相关的单词或不完整的句子。 But, I still can't remove those single word. 但是,我仍然无法删除这些单词。

For example, the format in text file 例如, 文本文件中的格式

I only want to get the last sentence. 我只想说最后一句话。 and as the following code is to divide a text into a list of sentence. 以及下面的代码将文本分为句子列表。

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer.tokenize(data)
print('\n'.join(tokenizer.tokenize(data)))

But, how could I exclude those single words line by line? 但是,如何才能逐行排除这些单词呢? Thank you 谢谢

This can be done by using the split method on each line of the text file. 这可以通过在文本文件的每一行上使用split方法来完成。

file_list = []
file = open('Your Text File.txt', 'r')
for line in file:
    splitted_lines = line.split(' ')
    if len(splitted_lines) > 1 :
        file_list.append(' '.join(splitted_lines))

outfile = ''.join(file_list)
file_out = open('outfile.txt', 'w')
file_out.write(outfile)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM