简体   繁体   English

python 提取含有关键词的句子

[英]python extract sentences containing keyword(s)

I am writing a script to extract from a text file any sentence containing any one of several keywords.我正在编写一个脚本以从文本文件中提取包含多个关键字中的任何一个的任何句子。

The first version of the script is脚本的第一个版本是

    keywords=['coal','solar'] 

    fileinE =[“We provide detailed guidance on our equity coal capital 
    raising plans”,”First we are seizing the issuance of new shares under the 
    DRIP program with immediate effect”,”Resulting in a total of about $160 
    million of new share solar issued under the program in 2020”]
        
      
    fileinF=[] 

    for sent in fileinE:
    tokenized_sent=[word.lower() for word in word_tokenize(sent)]
    if any(keyw in tokenized_sent for keyw in keywords):
        fileinF.append(tokenized_sent)
        print (fileinF)
    
    [['we', 'provide', 'detailed', 'guidance', 'on', 'our', 'equity', 'coal', 
    'capital', 'raising', 'plans'], ['resulting', 'in', 'a', 'total', 'of', 
    'about', '$', '160', 'million', 'of', 'new', 'share', 'solar', 'issued', 
    'under', 'the', 'program', 'in', '2020']]

The script performed as iended.脚本按预期执行。

I then changed the script to read in the stopwords from a file.然后我更改了脚本以从文件中读取停用词。

    with open ('KeywordsEDF A.txt','r')fileinF=[]

    print(keywords)

    for sent in fileinE:
        tokenized_sent=[word.lower() for word in word_tokenize(sent)]
        if any(keyw in tokenized_sent for keyw in keywords):
        fileinF.append(tokenized_sent)
        print (fileinF)
        
        ['coal','solar']


        ['resulting', 'in', 'a', 'total', 'of', 'about', '$', '160', 
        'million', 'of', 'new', 'share', 'solar', 'issued', 'under', 'the', 
        'program', 'in', '2020']]

There is a problem.有一个问题。 The output (fileinF) does not contain the sentence [ 'we', 'provide', 'detailed', 'guidance', 'on', 'our', 'equity', 'coal', 'capital', 'raising', 'plans'] and the only difference that I see in the two scripts is that in the first the keywords are included within the script while in the second the are read in from a file. output (fileinF) 不包含句子 [ 'we', 'provide', 'detailed', 'guidance', 'on', 'our', 'equity', 'coal', 'capital', 'raising' , 'plans'] 我在这两个脚本中看到的唯一区别是,第一个关键字包含在脚本中,而第二个关键字是从文件中读入的。

Advice or insight in how to correct the problem will be appreciated.将不胜感激有关如何纠正问题的建议或见解。

Based on your provided code, I was able to produce a working output. Make sure to format your code correctly when you ask a question, as issues may be due to white space or other factors (quotes on list item were being broken by an apostrophe in "we're").根据您提供的代码,我能够生成一个有效的 output。请确保在提问时正确格式化您的代码,因为问题可能是由于空格或其他因素造成的(列表项上的引号被撇号打断)在“我们是”)。

from nltk import word_tokenize

'''
with open ('KeywordsEDF A.txt','r') as filein:
    keywords=filein.read()
'''

keywords = ['coal', 'solar']

fileinE = ["We provide detailed guidance on our equity coal capital raising plans",
           "First, we’re seizing the issuance of new shares under the DRIP program with immediate effect",
           "Resulting in a total of about $160 million of new share solar issued under the program in 2020"]

# extract sentences containing keywords
fileinF = []
for sent in fileinE:
    tokenized_sent = [word.lower() for word in word_tokenize(sent)]
    if any(keyw in tokenized_sent for keyw in keywords):
        fileinF.append(sent)
print(fileinF)

Assuming you want the original sentence and not a tokenized sentence, the output will be as below:假设您想要原始句子而不是标记化句子,则 output 将如下所示:

['We provide detailed guidance on our equity coal capital raising plans', 'Resulting in a total of about $160 million of new share solar issued under the program in 2020']

That could help out这可以帮上忙

file = open('your_file_path').read().lower().split('\n') 
# To get all Sentences list from file

keywords = ['coal','solar']
result = [sen  for sen in file if any([key in sen for key in keywords])]

# All Sentences containing keywords will store in result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM