[英]How to extract unknown number of different parts from string with Python regex?
[英]regex extract different parts of a string in consistent order
我有一个字符串列表
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
我想提取:
我可以做:
import re
pattern = r'.*(\d{4}-\d{2}-\d{2}).*with \b([^\b]+)\b.*'
matched = [re.match(pattern, x).groups() for x in my_strings]
但它失败了,因为模式与"with Tom on 2015-06-30"
不匹配。
如何指定正则表达式模式对日期或人物出现在字符串中的顺序无动于衷?
和
如何确保groups()
方法每次都以相同的顺序返回它们?
我希望输出看起来像这样?
[('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
用2个独立的正则表达式做什么呢?
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
import re
pattern = r'.*(\d{4}-\d{2}-\d{2})'
dates = [re.match(pattern, x).groups()[0] for x in my_strings]
pattern = r'.*with (\w+).*'
persons = [re.match(pattern, x).groups()[0] for x in my_strings]
output = zip(dates, persons)
print output
## [('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
这应该工作:
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
import re
alternates = r"(?:\b(\d{4}-\d\d-\d\d)\b|with (\w+)|.)*"
for tc in my_strings:
print(tc)
m = re.match(alternates, tc)
if m:
print("\t", m.group(1))
print("\t", m.group(2))
输出是:
$ python test.py
2002-03-04 with Matt
2002-03-04
Matt
Important: 2016-01-23 with Mary
2016-01-23
Mary
with Tom on 2015-06-30
2015-06-30
Tom
但是,这样的事情并不完全直观。 我鼓励你尽可能尝试使用命名组 。
出于教育原因,非正则表达式方法可能涉及在“模糊”模式下使用dateutil
解析器来提取日期,并使用命名实体识别来提取nltk
工具包以提取名称。 完整代码:
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
from dateutil.parser import parse
def extract_names(text):
tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(text)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos)
return [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30"
]
for s in my_strings:
print(parse(s, fuzzy=True))
print(extract_names(s))
打印:
2002-03-04 00:00:00
['Matt']
2016-01-23 00:00:00
['Mary']
2015-06-30 00:00:00
['Tom']
但这可能是一个过于复杂的问题。
如果您使用Python的新正则表达式模块, 则可以使用条件来获取
保证匹配2件物品。
我认为这更像是一个无序匹配的标准。
(?:.*?(?:(?(1)(?!))\\b(\\d{4}-\\d\\d-\\d\\d)\\b|(?(2)(?!))with[ ](\\w+))){2}
扩展
(?:
.*?
(?:
(?(1)(?!))
\b
( \d{4} - \d\d - \d\d ) # (1)
\b
| (?(2)(?!))
with [ ]
( \w+ ) # (2)
)
){2}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.