简体   繁体   中英

Python Regex for Phone Numbers is acting strangely

I've developed a Python Regex that pulls phone numbers from text around 90% of the time. However, there are sometimes weird anomalies. My code is as follows:

phone_pattern = re.compile(r'(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})')
df['phone'] = df['text'].apply(lambda x: phone_pattern.findall(x))
df['phone']=df['phone'].apply(lambda y: '' if len(y)==0 else y)
df['phone'] = df['phone'].apply(', '.join)

This code extracts the phone numbers and appends a new column called "phone." If there are multiple numbers, they are separated by a comma.

The following text, however, generates a weird output:

university of blah school of blah blah blah (jane doe doe) 1234567890 1234 miller Dr E233 MILLER DR blah blah fl zipcode in the morning or maybe Monday.

The output my current code gives me is:

890 1234

Rather than the desired actual number of:

1234567890

This happens on a few examples. I've tried editing the regex, but it only makes it worse. Any help would be appreciated. Also, I think this question is useful, because a lot of the phone regex offered on Stackoverflow haven't worked for me.

You may use

(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b

See the regex demo

Note that \b word boundary is added before the first and third only alternatives, the second one starts with \( pattern that matches a ( and needs no word boundary check. There is a word boundary at the end, too. Besides, the [-.\s] delimiter in the first alternative is made optional, a ? quantifier makes it match 1 or 0 times.

In Pandas, just use

rx = r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b'
df['phone'] = df['text'].str.findall(rx).apply(', '.join)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM