使用Regex re.sub删除包含指定单词之前的所有内容

Question

I've got a string, which looks like "Blah blah blah, Updated: Aug. 23, 2012", from which I want to use Regex to extract just the date Aug. 23, 2012 . 我有一个字符串，看起来像“Blah blah blah，Updated：2012年8月23日”，我希望使用正则表达式来提取Aug. 23, 2012的日期。 I found an article in the stacks which has something similar: regex to remove all text before a character , but that's not working either when I tried 我发现堆栈中的一篇文章有类似的东西：正则表达式删除一个字符前的所有文本，但是当我尝试时它也不起作用

date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^[^Updated]*',"", date_div)

How can I remove everything up to and including Updated, so that only Aug. 23, 2012 is left over? 我怎样才能删除所有内容，包括更新内容，以便仅剩下Aug. 23, 2012 ？

Thanks! 谢谢！

Answer 1

In this case, you can do it withot regex, eg: 在这种情况下，您可以使用正则表达式执行此操作，例如：

>>> date_div = "Blah blah blah, Updated: Aug. 23, 2012"
>>> date_div.split('Updated: ')
['Blah blah blah, ', 'Aug. 23, 2012']
>>> date_div.split('Updated: ')[-1]
'Aug. 23, 2012'

Answer 2

You can use Lookahead: 你可以使用Lookahead：

import re
date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^(.*)(?=Updated)',"", date_div)
print extracted_date

OUTPUT OUTPUT

Updated: Aug. 23, 2012

EDIT 编辑
If MattDMo's comment below is correct and you want to remove the "Update: " as well you can do: 如果下面的MattDMo评论是正确的，你想要删除“更新：”，你可以这样做：

extracted_date = re.sub('^(.*Updated: )',"", date_div)

Answer 3

With a regex, you may use two regexps depending on the occurrence of the word: 使用正则表达式，您可以使用两个正则表达式，具体取决于单词的出现次数：

# Remove all up to the first occurrence of the word including it (non-greedy):
^.*?word
# Remove all up to the last occurrence of the word including it (greedy):
^.*word

See the non-greedy regex demo and a greedy regex demo . 查看非贪婪的正则表达式演示和贪婪的正则表达式演示。

The ^ matches the start of string position, .*? ^匹配字符串位置的开头， .*? matches any 0+ chars (mind the use of re.DOTALL flag so that . could match newlines) as few as possible ( .* matches as many as possible) and then word matches and consumes (ie adds to the match and advances the regex index) the word. 匹配任何0+字符（注意使用re.DOTALL标志，以便.可以匹配换行符）尽可能少（ .*尽可能多地匹配）然后word匹配和消耗（即添加到匹配并推进正则表达式）索引）这个词。

Note the use of re.escape(up_to_word) : if your up_to_word does not consist of sole alphanumeric and underscore chars, it is safer to use re.escape so that special chars like ( , [ , ? , etc. could not prevent the regex from finding a valid match. 注意，这里使用的re.escape(up_to_word)如果你的up_to_word并不是由唯一的字母数字和下划线字符，它是使用更安全re.escape使特殊字符像( ， [ ， ?等无法阻止的正则表达式从找到有效的匹配。

See the Python demo : 查看Python演示：

import re

date_div = "Blah blah\nblah, Updated: Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019"

up_to_word = "Updated:"
rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
rx_to_last = r'^.*{}'.format(re.escape(up_to_word))

print("Remove all up to the first occurrence of the word including it:")
print(re.sub(rx_to_first, '', date_div, flags=re.DOTALL).strip())
print("Remove all up to the last occurrence of the word including it:")
print(re.sub(rx_to_last, '', date_div, flags=re.DOTALL).strip())

Output: 输出：

Remove all up to the first occurrence of the word including it:
Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019
Remove all up to the last occurrence of the word including it:
Feb. 13, 2019

使用Regex re.sub删除包含指定单词之前的所有内容

问题描述

3 个解决方案

解决方案1
8 已采纳 2014-07-30 19:36:16

解决方案2
5 2014-07-30 19:33:43

解决方案3
2 2019-02-13 08:20:04

使用Regex re.sub删除包含指定单词之前的所有内容

问题描述

3 个解决方案

解决方案1 8 已采纳 2014-07-30 19:36:16

解决方案2 5 2014-07-30 19:33:43

解决方案3 2 2019-02-13 08:20:04

解决方案1
8 已采纳 2014-07-30 19:36:16

解决方案2
5 2014-07-30 19:33:43

解决方案3
2 2019-02-13 08:20:04