[英]regex extract different parts of a string in consistent order
I have a list of strings 我有一个字符串列表
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
I want to extract: 我想提取:
I could do: 我可以做:
import re
pattern = r'.*(\d{4}-\d{2}-\d{2}).*with \b([^\b]+)\b.*'
matched = [re.match(pattern, x).groups() for x in my_strings]
but it fails because pattern doesn't match "with Tom on 2015-06-30"
. 但它失败了,因为模式与
"with Tom on 2015-06-30"
不匹配。
How do I specify the regex pattern to be indifferent to the order in which date or person appear in the string? 如何指定正则表达式模式对日期或人物出现在字符串中的顺序无动于衷?
and 和
How do I ensure that the groups()
method returns them in the same order every time? 如何确保
groups()
方法每次都以相同的顺序返回它们?
I expect the output to look like this? 我希望输出看起来像这样?
[('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
What about doing it with 2 separate regex? 用2个独立的正则表达式做什么呢?
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
import re
pattern = r'.*(\d{4}-\d{2}-\d{2})'
dates = [re.match(pattern, x).groups()[0] for x in my_strings]
pattern = r'.*with (\w+).*'
persons = [re.match(pattern, x).groups()[0] for x in my_strings]
output = zip(dates, persons)
print output
## [('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
This should work: 这应该工作:
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
import re
alternates = r"(?:\b(\d{4}-\d\d-\d\d)\b|with (\w+)|.)*"
for tc in my_strings:
print(tc)
m = re.match(alternates, tc)
if m:
print("\t", m.group(1))
print("\t", m.group(2))
Output is: 输出是:
$ python test.py
2002-03-04 with Matt
2002-03-04
Matt
Important: 2016-01-23 with Mary
2016-01-23
Mary
with Tom on 2015-06-30
2015-06-30
Tom
However, something like this is not totally intuitive. 但是,这样的事情并不完全直观。 I encourage you to try using named groups if at all possible.
我鼓励你尽可能尝试使用命名组 。
Just for education reasons, a non-regex approach could involve using dateutil
parser in a "fuzzy" mode to extract the dates and the nltk
toolkit with the named entity recognition to extract names. 出于教育原因,非正则表达式方法可能涉及在“模糊”模式下使用
dateutil
解析器来提取日期,并使用命名实体识别来提取nltk
工具包以提取名称。 Complete code: 完整代码:
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
from dateutil.parser import parse
def extract_names(text):
tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(text)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos)
return [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30"
]
for s in my_strings:
print(parse(s, fuzzy=True))
print(extract_names(s))
Prints: 打印:
2002-03-04 00:00:00
['Matt']
2016-01-23 00:00:00
['Mary']
2015-06-30 00:00:00
['Tom']
That's probably an over-complication though. 但这可能是一个过于复杂的问题。
If you use Python's new regex module, you can use conditionals to get 如果您使用Python的新正则表达式模块, 则可以使用条件来获取
a guaranteed match on 2 items. 保证匹配2件物品。
I'd think this is more like a standard to do out-of-order matching. 我认为这更像是一个无序匹配的标准。
(?:.*?(?:(?(1)(?!))\\b(\\d{4}-\\d\\d-\\d\\d)\\b|(?(2)(?!))with[ ](\\w+))){2}
Expanded 扩展
(?:
.*?
(?:
(?(1)(?!))
\b
( \d{4} - \d\d - \d\d ) # (1)
\b
| (?(2)(?!))
with [ ]
( \w+ ) # (2)
)
){2}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.