分组正则表达式python的最佳实践

Question

I have a list of strings containing arbitary phone numbers in python. 我有一个包含Python中任意电话号码的字符串列表。 The extension is an optional part. 扩展名是可选部分。

st = ['(800) 555-1212',
'1-800-555-1212',
'800-555-1212x1234',
'800-555-1212 ext. 1234',
'work 1-(800) 555.1212 #1234']

My objective is to segregate the phone numbers so that I can isolate each individual group viz. 我的目标是隔离电话号码，以便我可以隔离每个单独的组。 '800', '555', '1212' and the optional '1234'. '800'，'555'，'1212'和可选的'1234'。

I have tried out the following code. 我已经尝试了以下代码。

p1 = re.compile(r'(\d{3}).*(\d{3}).*(\d{4}).*(\d{4})?')
step1 = [re.sub(r'\D','',p1.search(t).group()) for t in st]
p2 = re.compile(r'(\d{3})(\d{3})(\d{4})(\d{4})?')
step2 = [p2.search(t).groups() for t in step1]

p1 and p2 being the two patterns to fetch the desired output. p1和p2是获取所需输出的两种模式。

for i in range(len(step2)):
print step2[i]

The output was: 输出为：

('800', '555', '1212', None)
('800', '555', '1212', None)
('800', '555', '1212', '1234')
('800', '555', '1212', '1234')
('800', '555', '1212', '1234')

Since I am a newbie, I wish to get suggestions if there are better ways to tacle such problems or some best practices followed in Python community. 因为我是新手，所以如果有更好的方法解决此类问题或Python社区遵循的一些最佳做法，我希望得到建议。 Thanks in advance. 提前致谢。

Answer 1

I think re.findall and the similarity of the groups allow you a simpler approach: 我认为re.findall和这些组的相似性可以为您提供一种更简单的方法：

>>> import re
>>> from pprint import pprint
>>> res = [re.findall(r'\d{3,4}', s) for s in st]
>>> pprint res
[['800', '555', '1212'],
 ['800', '555', '1212'],
 ['800', '555', '1212', '1234'],
 ['800', '555', '1212', '1234'],
 ['800', '555', '1212', '1234']]

Answer 2

Instead of trying to match the entire string and capturing the desired substrings, you can just match digits with lenghts 3 or 4. 不必尝试匹配整个字符串并捕获所需的子字符串，您只需将数字与长度3或4匹配即可。

Demo on Regex101: https://regex101.com/r/XNbb79/1 Regex101上的演示： https ://regex101.com/r/XNbb79/1

import re

st = ['(800) 555-1212',
'1-800-555-1212',
'800-555-1212x1234',
'800-555-1212 ext. 1234',
'work 1-(800) 555.1212 #1234']

for b in [re.findall('\d{3,4}', a) for a in st]:
    if len(b) == 3:
        print "number does not have extension"
        print b
    if len(b) == 4:
        print "number has extension"
        print b

Output: 输出：

number does not have extension
['800', '555', '1212']
number does not have extension
['800', '555', '1212']
number has extension
['800', '555', '1212', '1234']
number has extension
['800', '555', '1212', '1234']
number has extension
['800', '555', '1212', '1234']

Answer 3

One more (modification of yours): 另一项（您的修改）：

import re
pattern = re.compile('.*(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})[^\d]*(\d{4})?$')
print [[pattern.match(s).group(i) for i in range(1,5)] for s in st]

#[['800', '555', '1212', None], ['800', '555', '1212', None], ['800', '555', '1212', '1234'], ['800', '555', '1212', '1234'], ['800', '555', '1212', '1234']]

分组正则表达式python的最佳实践

问题描述

3 个解决方案

解决方案1
1 2017-01-22 16:37:44

解决方案2
1 2017-01-22 16:38:04

解决方案3
0 2017-01-22 16:55:24

分组正则表达式python的最佳实践

问题描述

3 个解决方案

解决方案1 1 2017-01-22 16:37:44

解决方案2 1 2017-01-22 16:38:04

解决方案3 0 2017-01-22 16:55:24

解决方案1
1 2017-01-22 16:37:44

解决方案2
1 2017-01-22 16:38:04

解决方案3
0 2017-01-22 16:55:24