Python Pandas Data Cleaning

Question

I am trying to read a large log file, which has been parsed using different delimiters (legacy issue).

Code

for root, dirs, files in os.walk('.', topdown=True):
    for file in files:
        df = pd.read_csv(file, sep='\n', header=None, skipinitialspace=True)
        df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
        df.email = df.email.str.lower()
        print(df)

input-file

user1@email.com         address1
User2@email.com    address2
 user3@email.com,address3
  user4@email.com;;addre'ss4
UseR5@email.com,,address"5
user6@email.com,,address;6
single.col1;
 single.col2                 [spaces at the beginning of the row]
    single.col3              [tabs at the beginning of the row]
nonascii.row;data.is.junk-Œœ
not.email;address11
not_email;address22

Issues

Rows which contain any non-ascii characters, need to be removed from the DF (I mean the full row needs to be excluded and purged)
Rows with tabs or spaces in the beginning needs to be trimmed. I have 'skipinitialspace=True', but seems like this will not remove the tabs
Need to check the 'df.email' to see if this is a valid email regex format. If not, the full row needs to be purged

Would appreciate any help

Answer 1

df = pd.read_csv(file, sep='\n', header=None)    

#remove leading/trailing whitespace and split into columns
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})

#drop rows with non-ASCII (<32 or >255, you can adopt the second to your needs)
df = df[~df.data.fillna('').str.contains('[^ -ÿ]')]

#drop rows with invalid email addresses
email_re = "^\w+(?:[-+.']\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*$"
df = df[df.email.fillna('').str.contains(email_re)]

The email regex was taken from here (just changed the parentheses to non-grouping). If you want to be comprehensive you can use this monster-regex as well.

Python Pandas Data Cleaning

Question

1 answers

solution1
0 ACCPTED 2020-06-12 06:33:52

Python Pandas Data Cleaning

Question

1 answers

solution1 0 ACCPTED 2020-06-12 06:33:52

solution1
0 ACCPTED 2020-06-12 06:33:52