简体   繁体   English

使用关键字在Python中打印句子

[英]Using a keyword to print a sentence in Python

Hello I am writing a Python program that reads through a given .txt file and looks for keywords. 您好我正在编写一个Python程序,它读取给定的.txt文件并查找关键字。 In this program once I have found my keyword (for example 'data' ) I would like to print out the entire sentence the word is associated with. 在这个程序中,一旦我找到了我的关键字(例如'data' ),我想打印出与该词相关联的整个句子。

I have read in my input file and used the split() method to rid of spaces, tabs and newlines and put all the words into an array. 我已经在输入文件中读取并使用split()方法去除空格,制表符和换行符,并将所有单词放入数组中。

Here is the code I have thus far. 这是我到目前为止的代码。

text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'

for token in lines:
    if token == keyword:
         //I have found my keyword, what methods can I use to
        //print out the words before and after the keyword 
       //I have a feeling I want to use '.' as a marker for sentences
           print(sentence) //prints the entire sentence

file.txt Reads as follows file.txt阅读如下

Welcome to SOF! This website securely stores data for the user.

desired output: 期望的输出:

This website securely stores data for the user.

We can just split text on characters that represent line endings and then loop trough those lines and print those who contain our keyword. 我们可以在表示行结尾的字符上拆分文本,然后循环遍历这些行并打印包含我们关键字的那些行。

To split text on multiple characters , for example line ending can be marked with ! ? . 要在多个字符上拆分文本,例如行结尾可以标记为! ? . ! ? . we can use regex: 我们可以使用正则表达式:

import re

keyword = "data"
line_end_chars = "!", "?", "."
example = "Welcome to SOF! This website securely stores data for the user?"
regexPattern = '|'.join(map(re.escape, line_end_chars))
line_list = re.split(regexPattern, example)

# line_list looks like this:
# ['Welcome to SOF', ' This website securely stores data for the user', '']

# Now we just need to see which lines have our keyword
for line in line_list:
    if keyword in line:
        print(line)

But keep in mind that: if keyword in line: matches a sequence of characters, not necessarily a whole word - for example, 'data' in 'datamine' is True. 但请记住: if keyword in line:匹配一系列字符,不一定是整个单词 - 例如,'datamine'中的'data'为True。 If you only want to match whole words, you ought to use regular expressions: source explanation with example 如果你只想匹配整个单词,你应该使用正则表达式: 源代码说明

Source for regex delimiters 正则表达式分隔符的来源

My approach is similar to Alberto Poljak but a little more explicit. 我的方法类似于Alberto Poljak,但更明确一点。

The motivation is to realise that splitting on words is unnecessary - Python's in operator will happily find a word in a sentence. 的动机是为了实现对单词拆分是不必要的- Python的in运营商在一个句子里会很乐意找一个字。 What is necessary is the splitting of sentences. 什么是必要的分裂句子。 Unfortunately, sentences can end with . 不幸的是,句子可以结束. , ? ? or ! 或者! and Python's split function does not allow multiple separators. 和Python的split函数不允许多个分隔符。 So we have to get a little complicated and use re . 所以我们必须有点复杂并使用re

re requires us to put a | re要求我们放一个| between each delimiter and escape some of them, because both . 在每个分隔符之间并且逃避它们中的一些,因为两者. and ? ? have special meanings by default. 默认具有特殊含义。 Alberto's solution used re itself to do all this, which is definitely the way to go. 使用阿尔贝托的解决方案re自己做这一切,这肯定是要走的路。 But if you're new to re , my hard-coded version might be clearer. 但是,如果你是新来的re ,我硬编码的版本可能更清晰。

The other addition I made was to put each sentence's trailing delimiter back on the sentence it belongs to. 我做的另一个补充是将每个句子的尾随分隔符放回它所属的句子上。 To do this I wrapped the delimiters in () , which captures them in the output. 为此,我将分隔符包装在() ,它在输出中捕获它们。 I then used zip to put them back on the sentence they came from. 然后我用zip将它们放回到他们来自的句子上。 The 0::2 and 1::2 slices will take every even index (the sentences) and concatenate them with every odd index (the delimiters). 0::21::2切片将采用每个偶数索引(句子)并将它们与每个奇数索引(分隔符)连接起来。 Uncomment the print statement to see what's happening. 取消注释print语句以查看正在发生的情况。

import re

lines = "Welcome to SOF! This website securely stores data for the user. Another sentence."
keyword = "data"

sentences = re.split('(\.|!|\?)', lines)

sentences_terminated = [a + b for a,b in zip(sentences[0::2], sentences[1::2])]

# print(sentences_terminated)

for sentence in sentences_terminated:
    if keyword in sentence:
        print(sentence)
        break

Output: 输出:

 This website securely stores data for the user.

This solution uses a fairly simple regex in order to find your keyword in a sentence, with words that may or may not be before and after it, and a final period character. 此解决方案使用一个相当简单的正则表达式,以便在一个句子中找到您的关键字,其中包含可能在其之前和之后的单词,以及最终句点字符。 It works well with spaces and it's only one execution of re.search() . 它适用于空格,它只是re.search()一次执行。

import re

text_file = open("file.txt", "r")
text = text_file.read()

keyword = 'data'

match = re.search("\s?(\w+\s)*" + keyword + "\s?(\w+\s?)*.", text)
print(match.group().strip())

Another Solution: 另一种方案:

def check_for_stop_punctuation(token):
    stop_punctuation = ['.', '?', '!']
    for i in range(len(stop_punctuation)):
        if token.find(stop_punctuation[i]) > -1:
            return True
    return False

text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'

sentence = []
stop_punctuation = ['.', '?', '!']

i = 0
while i < len(lines):
    token = lines[i]
    sentence.append(token)
    if token == keyword:
        found_stop_punctuation = check_for_stop_punctuation(token)
        while not found_stop_punctuation:
            i += 1
            token = lines[i]
            sentence.append(token)
            found_stop_punctuation = check_for_stop_punctuation(token)
        print(sentence)
        sentence = []
    elif check_for_stop_punctuation(token):
        sentence = []
    i += 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM