简体   繁体   English

用正则表达式查找包含某些表达式的句子

[英]Find sentence containing certain expression with regex

This is for a school project on programming and im supposed to use only the re import. 这是针对学校编程的项目,我应该只使用重新导入。

I am trying to find all sentences in a text file containing certain expression defined by a parameter and extract them into a list. 我正在尝试在包含由参数定义的某些表达式的文本文件中查找所有句子,并将它们提取到列表中。 Searching other posts got me halfway there by finding the dots that start and end the sentence but if there is a number with a dot in there it ruins the result. 通过搜索其他帖子,我找到了句子开头和结尾的点,使我半途而废,但是如果其中有一个带点的数字,则结果会被破坏。

If I have a txt : This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working. 如果我有txt: This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working. This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working.

search = re.findall(r"([^.]*?"+expression+"[^.]*\.", txt)

The result I'm getting is ['576, I want to extract the phrase with this expression',] 我得到的结果是['576, I want to extract the phrase with this expression',]

The result I want is ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.'] 我想要的结果是['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

I'm still at beginner at this, any help? 我仍在初学者上,有什么帮助吗?

If I am not wrong you want to split sentences. 如果我没有记错,您想拆分句子。 For this aim best regex is this: 为此,最佳正则表达式是:

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', txt)

If this isn't work. 如果这不起作用。 You can replace extra points to commas in the sentence by this regex: 您可以使用此正则表达式替换句子中逗号的加点:

txt = re.sub(r'(\d*)\.(\d+)', r'\1,\2', txt)

Tokenize the text into sentences with NLTK , and then use a whole word search or a regular substring check. 使用NLTK将文本标记为句子 ,然后使用整个单词搜索或常规的子字符串检查。

Example with a whole word search: 全词搜索示例:

import nltk, re
text = "This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working."
sentences = nltk.sent_tokenize(text)
word = "expression"
print([sent for sent in sentences if re.search(r'\b{}\b'.format(word), sent)])
# => ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

If you do not need a whole word search replace if re.search(r'\\b{}\\b'.format(word), sent) with if word in sent . 如果您不需要整个单词搜索,请将if re.search(r'\\b{}\\b'.format(word), sent)替换为if word in sent

Maybe not the best solution but you can match all sentences in the text and later find the expression, like this: 也许不是最好的解决方案,但是您可以匹配文本中的所有句子,然后找到表达式,如下所示:

sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

matching = [s for s in sentences if "I want to extract the phrase with this expression" in s]

print(matching)

#Result:
# ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

Hope it helps! 希望能帮助到你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM