Removing cells/rows that don't contain regex match with Python

Question

I have a csv of a single column containing supposedly phone numbers. However, there are several cells containing the wrong number of digits, alphabet and symbols, etc. I just want to keep every row whose cell contains exactly ten numerical digits and delete all other rows.

I also have a file that includes a weird non-ascii character in the middle of every cell in the name column (after the first name). Is there a simple way to discard every non-ascii character in a file? I just want to get rid of it, though replacing it with a space would be fine, too.

I already have pandas imported in the code I'm running on my csvs, so if that helps, it's available.

Answer 1

First, let's generate some random phone numbers -- most will be good, some bad:

import random

li=[]
for i in range(50):
    li.append('{}-{}-{}'.format(random.randrange(1,999), random.randrange(1,999), random.randrange(100,9999)))

Then filter those:

import re
new_li=[e for e in li if re.match(r'\d\d\d-\d\d\d-\d\d\d\d', e)]

print new_li
print set(li)-set(new_li)    # show the ones removed...

If you want to do it Pandas, you can use the same regex and use .apply to the DataFrame:

df=pd.DataFrame(li, columns=['Phone'])
pat=re.compile(r'^\d{3}-\d{3}-\d{4}$')
print df
print df[df['Phone'].apply(lambda s: bool(pat.search(s)))]

Removing cells/rows that don't contain regex match with Python

Question

1 answers

solution1
3 ACCPTED 2014-01-07 22:53:06

Removing cells/rows that don't contain regex match with Python

Question

1 answers

solution1 3 ACCPTED 2014-01-07 22:53:06

solution1
3 ACCPTED 2014-01-07 22:53:06