[英]Formatting invalid data from CSV file Python
for an assignment I need to validate a dataset (CSV file) that I got.对于作业,我需要验证我得到的数据集(CSV 文件)。 It contains information of students: student number, first name, last name, date of birth, study program.
它包含学生的信息:学号,名字,姓氏,出生日期,学习计划。 I did that (checking for valid and invalid data) already, but for clarity, these are the requirements for that:
我已经这样做了(检查有效和无效数据),但为了清楚起见,这些是要求:
Then I need to print the valid and corrupt lines in the following format:然后我需要按以下格式打印有效行和损坏行:
### VALID LINES ###
0873226,Junette,Gurry,1987-12-05,CMD
0983960,Leoline,MacCaughen,1993-02-12,TINF
### CORRUPT LINES ###
0773226,Junette,Gur_ry,1995-12-05, => INVALID DATA: ['0773226', 'Gur_ry', '']
0795003,Edna,Douce,1957-06-23,INF => INVALID DATA: ['0795003', '1957-06-23']
Printing the valid lines works just fine, the problem I'm having is that I can't seem to print the invalid data from the corrupt lines correctly.打印有效行工作得很好,我遇到的问题是我似乎无法正确打印损坏行中的无效数据。 I've been trying different things now for hours, but I cannot find a solution that works.
几个小时以来,我一直在尝试不同的事情,但我找不到有效的解决方案。 Is there anyone who can help me out?
有没有人可以帮助我? I'll provide my code and a piece of the CSV file here down below.
我将在下面提供我的代码和 CSV 文件的一部分。
My code:我的代码:
import os
import sys
from datetime import datetime
valid_lines = []
corrupt_lines = []
tmp = []
n = 1
def validate_data(line):
global n
nr = False
fn = False
ln = False
date = False
prog = False
line2 = line
line = line.split(",")
# Checking if the student number meets the requirements
try:
if line[0][0] == "0" and len(line[0]) == 7:
if line[0][1] == "9" or line[0][1] == "8":
nr = True
else:
pass
else:
pass
except:
pass
# Checking if the first name meets the requirements
try:
if line[1] == '':
pass
elif line[1].isalpha:
fn = True
# print(True)
else:
pass
except:
pass
# Checking if the last name meets the requirements
try:
if line[2] == '':
pass
elif line[2].isalpha and line[2] != "123124" and "^" not in line[2]:
ln = True
else:
pass
except:
pass
# Checking if the the date meets the requirements
format = "%Y-%m-%d"
try:
date = bool(datetime.strptime(line[3], format))
if date == True:
year, month, day = line[3].split("-")
if int(year) >=1960 and int(year) <=2004:
date = True
else:
date = False
pass
else:
pass
except ValueError:
pass
# Checking if the study program meets the requirements
list = ["INF", "TINF", "CMD", "AI"]
try:
if line[4] in list:
prog = True
else:
pass
except ValueError:
pass
if nr == True and fn == True and ln == True and date == True and prog == True:
valid_lines.append(line2)
else:
# Trying to create a list with the invalid data.
corrupt_lines.append(line2)
if nr == False:
tmp.append(line[0])
if fn == False:
tmp.append(line[1])
if ln == False:
tmp.append(line[2])
if date == False:
tmp.append(line[3])
if prog == False:
tmp.append(line[4])
return tmp
def main(csv_file):
with open(os.path.join(sys.path[0], csv_file), newline='') as csv_file:
# skip header line
next(csv_file)
for line in csv_file:
validate_data(line.strip())
print('### VALID LINES ###')
print("\n".join(valid_lines))
print('### CORRUPT LINES ###')
print(" => INVALID DATA [] \n".join(corrupt_lines))
if __name__ == "__main__":
main('students.csv')
And a bit of the CSV file:还有一点 CSV 文件:
studentnumber,firstname,lastname,dateofbirth,studyprogram
0873226,Junette,Gurry,1987-12-05,CMD
0983960,Leoline,MacCaughen,1993-02-12,TINF
0875514,Derrik,Garnson,2007-06-23,CMD
0807295,Christy,Rodwell,1997-09-05,CMD
0844343,Frannie,555,1997-05-08,TINF
0798488,Darbie,Habbijam,1997-10-11,AI
0973065,Glory,McLernon,2007-07-20,AI
0803417,Selie,Gunter,1974-01-05,DS
0963866,Wyatan,Lidgey,1987-08-23,DS
0946101,Rubie,De Lorenzo,1972-01-20,CMD
0834576,Bendite,Jeenes,1974-12-10,DS
0982484,Terra,Eckert,1977-11-22,TINF
0755219,Jacky,Driuzzi,1980-07-27,CMD
0970338,Nariko,Blackley,2006-07-14,DS
0869610,,,,CMD
Thanks in advance, I really appreciate it!在此先感谢,我真的很感激!
The one big problem is that you're using global variables and having validate_data update those.一个大问题是您正在使用全局变量并让 validate_data 更新它们。 While you can know which lines are bad, you cannot know what parts of them are bad.
虽然您可以知道哪些线路是坏的,但您无法知道它们的哪些部分是坏的。
I recommend a decent restructure:我推荐一个体面的重组:
Declare a local variable in validate_data, like bad_data = []
and return that instead of tmp:在 validate_data 中声明一个局部变量,如
bad_data = []
并返回它而不是 tmp:
...
bad_data = []
if nr == False:
bad_data.append(line[0])
if fn == False:
bad_data.append(line[1])
if ln == False:
bad_data.append(line[2])
if date == False:
bad_data.append(line[3])
if prog == False:
bad_data.append(line[4])
return bad_data
Now you don't need to check if all the parts are True, just check for each invalid and update bad_data accordingly.现在您不需要检查所有部分是否为 True,只需检查每个无效部分并相应地更新 bad_data。 If bad_data is empty (
[]
) when validate_data returns, that means the line is valid.如果 validate_data 返回时 bad_data 为空 (
[]
),则表示该行有效。
Next, declare valid_lines and invalid_lines in main and update them in your line-loop based on what validate_data returns:接下来,在 main 中声明 valid_lines 和 invalid_lines 并根据 validate_data 返回的内容在你的 line-loop 中更新它们:
def main(csv_file):
valid_lines = []
corrupt_lines = []
...
for line in csv_file:
line = line.strip() # modify line to be its stripped self and use that going forward
bad_data = validate_data(line)
if bad_data: # means if bad_data != []
corrupt_lines.append(line + " => INVALID DATA [" + ", ".join(bad_data) + "]")
else:
valid_lines.append(line)
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.