Python 电话号码的正则表达式表现异常

Question

I've developed a Python Regex that pulls phone numbers from text around 90% of the time.我开发了一个 Python 正则表达式，大约 90% 的时间从文本中提取电话号码。 However, there are sometimes weird anomalies.但是，有时会出现奇怪的异常情况。 My code is as follows:我的代码如下：

phone_pattern = re.compile(r'(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})')
df['phone'] = df['text'].apply(lambda x: phone_pattern.findall(x))
df['phone']=df['phone'].apply(lambda y: '' if len(y)==0 else y)
df['phone'] = df['phone'].apply(', '.join)

This code extracts the phone numbers and appends a new column called "phone."此代码提取电话号码并附加一个名为“电话”的新列。 If there are multiple numbers, they are separated by a comma.如果有多个数字，它们用逗号分隔。

The following text, however, generates a weird output:但是，以下文本会生成一个奇怪的 output：

university of blah school of blah blah blah (jane doe doe) 1234567890 1234 miller Dr E233 MILLER DR blah blah fl zipcode in the morning or maybe Monday.

The output my current code gives me is:我当前的代码给我的 output 是：

890 1234

Rather than the desired actual number of:而不是所需的实际数量：

1234567890

This happens on a few examples.这发生在几个例子上。 I've tried editing the regex, but it only makes it worse.我试过编辑正则表达式，但这只会让情况变得更糟。 Any help would be appreciated.任何帮助，将不胜感激。 Also, I think this question is useful, because a lot of the phone regex offered on Stackoverflow haven't worked for me.另外，我认为这个问题很有用，因为 Stackoverflow 上提供的很多电话正则表达式对我不起作用。

Answer 1

You may use您可以使用

(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b

See the regex demo查看正则表达式演示

Note that \b word boundary is added before the first and third only alternatives, the second one starts with \( pattern that matches a ( and needs no word boundary check. There is a word boundary at the end, too. Besides, the [-.\s] delimiter in the first alternative is made optional, a ? quantifier makes it match 1 or 0 times.请注意，在第一个和第三个选项之前添加了\( \b单词边界，第二个以匹配 a (并且不需要单词边界检查的模式开头。最后也有一个单词边界。此外， [-.\s]第一个选项中的分隔符是可选的，一个?量词使它匹配 1 次或 0 次。

In Pandas, just use在 Pandas 中，只需使用

rx = r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b'
df['phone'] = df['text'].str.findall(rx).apply(', '.join)

Python 电话号码的正则表达式表现异常

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-30 12:56:38

Python 电话号码的正则表达式表现异常

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-30 12:56:38

解决方案1
1 已采纳 2020-04-30 12:56:38