简体   繁体   English

使用Regex re.sub删除包含指定单词之前的所有内容

[英]Use Regex re.sub to remove everything before and including a specified word

I've got a string, which looks like "Blah blah blah, Updated: Aug. 23, 2012", from which I want to use Regex to extract just the date Aug. 23, 2012 . 我有一个字符串,看起来像“Blah blah blah,Updated:2012年8月23日”,我希望使用正则表达式来提取Aug. 23, 2012的日期。 I found an article in the stacks which has something similar: regex to remove all text before a character , but that's not working either when I tried 我发现堆栈中的一篇文章有​​类似的东西: 正则表达式删除一个字符前的所有文本 ,但是当我尝试时它也不起作用

date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^[^Updated]*',"", date_div)

How can I remove everything up to and including Updated, so that only Aug. 23, 2012 is left over? 我怎样才能删除所有内容,包括更新内容,以便仅剩下Aug. 23, 2012

Thanks! 谢谢!

In this case, you can do it withot regex, eg: 在这种情况下,您可以使用正则表达式执行此操作,例如:

>>> date_div = "Blah blah blah, Updated: Aug. 23, 2012"
>>> date_div.split('Updated: ')
['Blah blah blah, ', 'Aug. 23, 2012']
>>> date_div.split('Updated: ')[-1]
'Aug. 23, 2012'

You can use Lookahead: 你可以使用Lookahead:

import re
date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^(.*)(?=Updated)',"", date_div)
print extracted_date

OUTPUT OUTPUT

Updated: Aug. 23, 2012

EDIT 编辑
If MattDMo's comment below is correct and you want to remove the "Update: " as well you can do: 如果下面的MattDMo评论是正确的,你想要删除“更新:”,你可以这样做:

extracted_date = re.sub('^(.*Updated: )',"", date_div)

With a regex, you may use two regexps depending on the occurrence of the word: 使用正则表达式,您可以使用两个正则表达式,具体取决于单词的出现次数:

# Remove all up to the first occurrence of the word including it (non-greedy):
^.*?word
# Remove all up to the last occurrence of the word including it (greedy):
^.*word

See the non-greedy regex demo and a greedy regex demo . 查看非贪婪的正则表达式演示贪婪的正则表达式演示

The ^ matches the start of string position, .*? ^匹配字符串位置的开头, .*? matches any 0+ chars (mind the use of re.DOTALL flag so that . could match newlines) as few as possible ( .* matches as many as possible) and then word matches and consumes (ie adds to the match and advances the regex index) the word. 匹配任何0+字符(注意使用re.DOTALL标志,以便.可以匹配换行符)尽可能.*尽可能地匹配)然后word匹配和消耗(即添加到匹配并推进正则表达式)索引)这个词。

Note the use of re.escape(up_to_word) : if your up_to_word does not consist of sole alphanumeric and underscore chars, it is safer to use re.escape so that special chars like ( , [ , ? , etc. could not prevent the regex from finding a valid match. 注意,这里使用的re.escape(up_to_word)如果你的up_to_word并不是由唯一的字母数字和下划线字符,它是使用更安全re.escape使特殊字符像([?等无法阻止的正则表达式从找到有效的匹配。

See the Python demo : 查看Python演示

import re

date_div = "Blah blah\nblah, Updated: Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019"

up_to_word = "Updated:"
rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
rx_to_last = r'^.*{}'.format(re.escape(up_to_word))

print("Remove all up to the first occurrence of the word including it:")
print(re.sub(rx_to_first, '', date_div, flags=re.DOTALL).strip())
print("Remove all up to the last occurrence of the word including it:")
print(re.sub(rx_to_last, '', date_div, flags=re.DOTALL).strip())

Output: 输出:

Remove all up to the first occurrence of the word including it:
Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019
Remove all up to the last occurrence of the word including it:
Feb. 13, 2019

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM