简体   繁体   English

Python 电话号码的正则表达式表现异常

[英]Python Regex for Phone Numbers is acting strangely

I've developed a Python Regex that pulls phone numbers from text around 90% of the time.我开发了一个 Python 正则表达式,大约 90% 的时间从文本中提取电话号码。 However, there are sometimes weird anomalies.但是,有时会出现奇怪的异常情况。 My code is as follows:我的代码如下:

phone_pattern = re.compile(r'(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})')
df['phone'] = df['text'].apply(lambda x: phone_pattern.findall(x))
df['phone']=df['phone'].apply(lambda y: '' if len(y)==0 else y)
df['phone'] = df['phone'].apply(', '.join)

This code extracts the phone numbers and appends a new column called "phone."此代码提取电话号码并附加一个名为“电话”的新列。 If there are multiple numbers, they are separated by a comma.如果有多个数字,它们用逗号分隔。

The following text, however, generates a weird output:但是,以下文本会生成一个奇怪的 output:

university of blah school of blah blah blah (jane doe doe) 1234567890 1234 miller Dr E233 MILLER DR blah blah fl zipcode in the morning or maybe Monday.

The output my current code gives me is:我当前的代码给我的 output 是:

890 1234

Rather than the desired actual number of:而不是所需的实际数量:

1234567890

This happens on a few examples.这发生在几个例子上。 I've tried editing the regex, but it only makes it worse.我试过编辑正则表达式,但这只会让情况变得更糟。 Any help would be appreciated.任何帮助,将不胜感激。 Also, I think this question is useful, because a lot of the phone regex offered on Stackoverflow haven't worked for me.另外,我认为这个问题很有用,因为 Stackoverflow 上提供的很多电话正则表达式对我不起作用。

You may use您可以使用

(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b

See the regex demo查看正则表达式演示

Note that \b word boundary is added before the first and third only alternatives, the second one starts with \( pattern that matches a ( and needs no word boundary check. There is a word boundary at the end, too. Besides, the [-.\s] delimiter in the first alternative is made optional, a ? quantifier makes it match 1 or 0 times.请注意,在第一个和第三个选项之前添加了\( \b单词边界,第二个以匹配 a (并且不需要单词边界检查的模式开头。最后也有一个单词边界。此外, [-.\s]第一个选项中的分隔符是可选的,一个?量词使它匹配 1 次或 0 次。

In Pandas, just use在 Pandas 中,只需使用

rx = r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b'
df['phone'] = df['text'].str.findall(rx).apply(', '.join)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM