[英]Python regex how to remove string at the end of sentence that starts with - and ends with a comma?
[英]Python remove sentence if it is at start of string and starts with specific words?
我的字符串看起来像:
docs = ['Hi, my name is Eric. Are you blue?',
"Hi, I'm ! What is your name?",
'This is a great idea. I would love to go.',
'Hello, I am Jane Brown. What is your name?',
"Hello, I am a doctor! Let's go to the mall.",
'I am ready to go. Mom says hello.']
如果字符串的第一句以“Hi”或“Hello”开头,我想删除它。
期望的输出:
docs = ['Are you blue?',
'What is your name?',
'This is a great idea. I would love to go.',
'What is your name?',
"Let's go to the mall."
'I am ready to go. Mom says hello.']
我拥有的正则表达式是:
re.match('.*?[a-z0-9][.?!](?= )', x))
但这只会以奇怪的格式给出第一句话,例如:
<re.Match object; span=(0, 41), match='Hi, my name is Eric.'>
我该怎么做才能得到我想要的输出?
您可以使用
docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s+', '', doc) for doc in docs]
请参阅正则表达式演示。 详情:
^
- 字符串的开头H(?:ello|i)\b
- Hello
或Hi
词( \b
是词边界).*?
- 尽可能少的除换行符以外的任何零个或多个字符[.?!]
- 一个.
, ?
或!
\s+
- 一个或多个空格。请参阅Python 演示:
import re
docs = ['Hi, my name is Eric. Are you blue?',
"Hi, I'm ! What is your name?",
'This is a great idea. I would love to go.',
'Hello, I am Jane Brown. What is your name?',
"Hello, I am a doctor! Let's go to the mall.",
'I am ready to go. Mom says hello.']
docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s+', '', doc) for doc in docs]
print(docs)
输出:
[
'Are you blue?',
'What is your name?',
'This is a great idea. I would love to go.',
'What is your name?',
"Let's go to the mall.",
'I am ready to go. Mom says hello.'
]
您必须首先将字符串拆分为句子
splitted_docs = []
for str in docs:
splitted_docs.append(str.split('.'))
然后,您想使用正则表达式检查每个句子的 Hi 或 Hello 并将其添加到最终数组
final_docs = []
for str in splitted_docs:
final_sentence = []
for sentence in str:
if not re.match('.*?[a-z0-9][.?!](?= )', sentence):
final_sentence.append(sentence)
final_docs.append(final_sentence.join('.'))
实际上,您的正则表达式不起作用,只是更改了代码以使其起作用,我如下所示:
for str in splitted_docs:
final_sentence = []
for sentence in str:
if not 'Hello' in sentence and not 'Hi' in sentence:
final_sentence.append(sentence)
final_docs.append('.'.join(final_sentence))
最后,过滤您的数组以删除可能在加入过程中创建的所有空字符串:
final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)
输出:
[' Are you blue?', 'This is a great idea. I would love to go.', ' What is your name?', 'I am ready to go. Mom says hello.']
我将在这里留下完整的代码,欢迎提出任何建议,我相信这可以通过一种更容易理解的更实用的方法来解决,但我对它的熟悉程度并不高。
import re
docs = ['Hi, my name is Eric. Are you blue?',
"Hi, I'm ! What is your name?",
'This is a great idea. I would love to go.',
'Hello, I am Jane Brown. What is your name?',
"Hello, I am a doctor! Let's go to the mall.",
'I am ready to go. Mom says hello.']
splitted_docs = []
for str in docs:
splitted_docs.append(str.split('.'))
final_docs = []
for str in splitted_docs:
final_sentence = []
for sentence in str:
if not 'Hello' in sentence and not 'Hi' in sentence:
final_sentence.append(sentence)
final_docs.append('.'.join(final_sentence))
final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.