简体   繁体   English

Python Pandas 数据清洗

[英]Python Pandas Data Cleaning

I am trying to read a large log file, which has been parsed using different delimiters (legacy issue).我正在尝试读取一个大型日志文件,该文件已使用不同的分隔符(遗留问题)进行了解析。

Code代码

for root, dirs, files in os.walk('.', topdown=True):
    for file in files:
        df = pd.read_csv(file, sep='\n', header=None, skipinitialspace=True)
        df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
        df.email = df.email.str.lower()
        print(df)

input-file输入文件

user1@email.com         address1
User2@email.com    address2
 user3@email.com,address3
  user4@email.com;;addre'ss4
UseR5@email.com,,address"5
user6@email.com,,address;6
single.col1;
 single.col2                 [spaces at the beginning of the row]
    single.col3              [tabs at the beginning of the row]
nonascii.row;data.is.junk-Œœ
not.email;address11
not_email;address22

Issues问题

  • Rows which contain any non-ascii characters, need to be removed from the DF (I mean the full row needs to be excluded and purged)需要从 DF 中删除包含任何非 ascii 字符的行(我的意思是需要排除和清除整行)
  • Rows with tabs or spaces in the beginning needs to be trimmed.需要修剪开头带有制表符或空格的行。 I have 'skipinitialspace=True', but seems like this will not remove the tabs我有'skipinitialspace = True',但似乎这不会删除标签
  • Need to check the 'df.email' to see if this is a valid email regex format.需要检查“df.email”以查看这是否是有效的 email 正则表达式格式。 If not, the full row needs to be purged如果不是,则需要清除整行

Would appreciate any help将不胜感激任何帮助

df = pd.read_csv(file, sep='\n', header=None)    

#remove leading/trailing whitespace and split into columns
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})

#drop rows with non-ASCII (<32 or >255, you can adopt the second to your needs)
df = df[~df.data.fillna('').str.contains('[^ -ÿ]')]

#drop rows with invalid email addresses
email_re = "^\w+(?:[-+.']\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*$"
df = df[df.email.fillna('').str.contains(email_re)]

The email regex was taken from here (just changed the parentheses to non-grouping). email 正则表达式取自此处(只是将括号更改为非分组)。 If you want to be comprehensive you can use this monster-regex as well.如果你想全面,你也可以使用这个怪物正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM