简体   繁体   中英

Removing cells/rows that don't contain regex match with Python

I have a csv of a single column containing supposedly phone numbers. However, there are several cells containing the wrong number of digits, alphabet and symbols, etc. I just want to keep every row whose cell contains exactly ten numerical digits and delete all other rows.

I also have a file that includes a weird non-ascii character in the middle of every cell in the name column (after the first name). Is there a simple way to discard every non-ascii character in a file? I just want to get rid of it, though replacing it with a space would be fine, too.

I already have pandas imported in the code I'm running on my csvs, so if that helps, it's available.

First, let's generate some random phone numbers -- most will be good, some bad:

import random

li=[]
for i in range(50):
    li.append('{}-{}-{}'.format(random.randrange(1,999), random.randrange(1,999), random.randrange(100,9999))) 

Then filter those:

import re
new_li=[e for e in li if re.match(r'\d\d\d-\d\d\d-\d\d\d\d', e)]

print new_li
print set(li)-set(new_li)    # show the ones removed...

If you want to do it Pandas, you can use the same regex and use .apply to the DataFrame:

df=pd.DataFrame(li, columns=['Phone'])
pat=re.compile(r'^\d{3}-\d{3}-\d{4}$')
print df
print df[df['Phone'].apply(lambda s: bool(pat.search(s)))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM