Python Pandas 数据清洗

Question

I am trying to read a large log file, which has been parsed using different delimiters (legacy issue).我正在尝试读取一个大型日志文件，该文件已使用不同的分隔符（遗留问题）进行了解析。

Code代码

for root, dirs, files in os.walk('.', topdown=True):
    for file in files:
        df = pd.read_csv(file, sep='\n', header=None, skipinitialspace=True)
        df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
        df.email = df.email.str.lower()
        print(df)

input-file输入文件

user1@email.com         address1
User2@email.com    address2
 user3@email.com,address3
  user4@email.com;;addre'ss4
UseR5@email.com,,address"5
user6@email.com,,address;6
single.col1;
 single.col2                 [spaces at the beginning of the row]
    single.col3              [tabs at the beginning of the row]
nonascii.row;data.is.junk-Œœ
not.email;address11
not_email;address22

Issues问题

Rows which contain any non-ascii characters, need to be removed from the DF (I mean the full row needs to be excluded and purged)需要从 DF 中删除包含任何非 ascii 字符的行（我的意思是需要排除和清除整行）
Rows with tabs or spaces in the beginning needs to be trimmed.需要修剪开头带有制表符或空格的行。 I have 'skipinitialspace=True', but seems like this will not remove the tabs我有'skipinitialspace = True'，但似乎这不会删除标签
Need to check the 'df.email' to see if this is a valid email regex format.需要检查“df.email”以查看这是否是有效的 email 正则表达式格式。 If not, the full row needs to be purged如果不是，则需要清除整行

Would appreciate any help将不胜感激任何帮助

Answer 1

df = pd.read_csv(file, sep='\n', header=None)    

#remove leading/trailing whitespace and split into columns
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})

#drop rows with non-ASCII (<32 or >255, you can adopt the second to your needs)
df = df[~df.data.fillna('').str.contains('[^ -ÿ]')]

#drop rows with invalid email addresses
email_re = "^\w+(?:[-+.']\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*$"
df = df[df.email.fillna('').str.contains(email_re)]

The email regex was taken from here (just changed the parentheses to non-grouping). email 正则表达式取自此处（只是将括号更改为非分组）。 If you want to be comprehensive you can use this monster-regex as well.如果你想全面，你也可以使用这个怪物正则表达式。

Python Pandas 数据清洗

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-12 06:33:52

Python Pandas 数据清洗

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-12 06:33:52

解决方案1
0 已采纳 2020-06-12 06:33:52