简体   繁体   English

格式化来自 CSV 文件 Python 的无效数据

[英]Formatting invalid data from CSV file Python

for an assignment I need to validate a dataset (CSV file) that I got.对于作业,我需要验证我得到的数据集(CSV 文件)。 It contains information of students: student number, first name, last name, date of birth, study program.它包含学生的信息:学号,名字,姓氏,出生日期,学习计划。 I did that (checking for valid and invalid data) already, but for clarity, these are the requirements for that:我已经这样做了(检查有效和无效数据),但为了清楚起见,这些是要求:

  • Student number has this format: 7 digits, starting with 0 and second digit (from left) can be either 9 or 8. Example: 0212345 is not valid学号格式如下:7 位数字,从 0 开始,第二位数字(左起)可以是 9 或 8。示例:0212345 无效
  • First name and last names, contains only alphabet.名字和姓氏,仅包含字母。
  • Date of birth has this format: YYYY-MM-DD.出生日期采用以下格式:YYYY-MM-DD。 Days between 1 and 31, months between 1 and 12 and Years between 1960 and 2004. 1 到 31 之间的天,1 到 12 之间的月以及 1960 到 2004 之间的年。
  • Study program can have one of these values: INF, TINF, CMD, AI.研究计划可以具有以下值之一:INF、TINF、CMD、AI。

Then I need to print the valid and corrupt lines in the following format:然后我需要按以下格式打印有效行和损坏行:

### VALID LINES ###
0873226,Junette,Gurry,1987-12-05,CMD
0983960,Leoline,MacCaughen,1993-02-12,TINF

### CORRUPT LINES ###
0773226,Junette,Gur_ry,1995-12-05, => INVALID DATA: ['0773226', 'Gur_ry', '']
0795003,Edna,Douce,1957-06-23,INF => INVALID DATA: ['0795003', '1957-06-23']

Printing the valid lines works just fine, the problem I'm having is that I can't seem to print the invalid data from the corrupt lines correctly.打印有效行工作得很好,我遇到的问题是我似乎无法正确打印损坏行中的无效数据。 I've been trying different things now for hours, but I cannot find a solution that works.几个小时以来,我一直在尝试不同的事情,但我找不到有效的解决方案。 Is there anyone who can help me out?有没有人可以帮助我? I'll provide my code and a piece of the CSV file here down below.我将在下面提供我的代码和 CSV 文件的一部分。

My code:我的代码:

import os
import sys
from datetime import datetime

valid_lines = []
corrupt_lines = []
tmp = []

n = 1
def validate_data(line):
    global n
    nr = False
    fn = False
    ln = False
    date = False
    prog = False
    line2 = line
    line = line.split(",")
    # Checking if the student number meets the requirements
    try:
      if line[0][0] == "0" and len(line[0]) == 7:
        if line[0][1] == "9" or line[0][1] == "8":
            nr = True
        else:
            pass
      else:
             pass
    except:
            pass
    # Checking if the first name meets the requirements
    try:
        if line[1] == '':
            pass
        elif line[1].isalpha:
            fn = True
            # print(True)
        else:
            pass
    except:
            pass
    # Checking if the last name meets the requirements
    try:
        if line[2] == '':
            pass
        elif line[2].isalpha and line[2] != "123124" and "^" not in line[2]:
            ln = True
        else:
            pass
    except:
            pass
    # Checking if the the date meets the requirements
    format = "%Y-%m-%d"
    try:
        date = bool(datetime.strptime(line[3], format))
        if date == True:
          year, month, day = line[3].split("-")
          if int(year) >=1960 and int(year) <=2004:
             date = True
          else:
              date = False
              pass
        else:
            pass
    except ValueError:
        pass
    # Checking if the study program meets the requirements
    list = ["INF", "TINF", "CMD", "AI"]
    try:
        if line[4] in list:
            prog = True
        else:
            pass
    except ValueError:
          pass
    if nr == True and fn == True and ln == True and date == True and prog == True:
        valid_lines.append(line2)
    else:
        # Trying to create a list with the invalid data.
        corrupt_lines.append(line2)
        if nr == False:
            tmp.append(line[0])
        if fn == False:
            tmp.append(line[1])
        if ln == False:
            tmp.append(line[2])
        if date == False:
            tmp.append(line[3])
        if prog == False:
            tmp.append(line[4])
    return tmp

def main(csv_file):
    with open(os.path.join(sys.path[0], csv_file), newline='') as csv_file:
        # skip header line
        next(csv_file)

        for line in csv_file:
            validate_data(line.strip())

    print('### VALID LINES ###')
    print("\n".join(valid_lines))
    print('### CORRUPT LINES ###')
    print(" => INVALID DATA [] \n".join(corrupt_lines))

if __name__ == "__main__":    
    main('students.csv')

And a bit of the CSV file:还有一点 CSV 文件:

studentnumber,firstname,lastname,dateofbirth,studyprogram
0873226,Junette,Gurry,1987-12-05,CMD
0983960,Leoline,MacCaughen,1993-02-12,TINF
0875514,Derrik,Garnson,2007-06-23,CMD
0807295,Christy,Rodwell,1997-09-05,CMD
0844343,Frannie,555,1997-05-08,TINF
0798488,Darbie,Habbijam,1997-10-11,AI
0973065,Glory,McLernon,2007-07-20,AI
0803417,Selie,Gunter,1974-01-05,DS
0963866,Wyatan,Lidgey,1987-08-23,DS
0946101,Rubie,De Lorenzo,1972-01-20,CMD
0834576,Bendite,Jeenes,1974-12-10,DS
0982484,Terra,Eckert,1977-11-22,TINF
0755219,Jacky,Driuzzi,1980-07-27,CMD
0970338,Nariko,Blackley,2006-07-14,DS
0869610,,,,CMD

Thanks in advance, I really appreciate it!在此先感谢,我真的很感激!

The one big problem is that you're using global variables and having validate_data update those.一个大问题是您正在使用全局变量并让 validate_data 更新它们。 While you can know which lines are bad, you cannot know what parts of them are bad.虽然您可以知道哪些线路是坏的,但您无法知道它们的哪些部分是坏的。

I recommend a decent restructure:我推荐一个体面的重组:

  • Get rid of the global variables and have validate_data return the parts of a line that are bad, if any.摆脱全局变量并让 validate_data 返回一行中错误的部分(如果有的话)。 You're kinda already doing that, but again you're using the global variable tmp which has unwanted side effects right now.你有点已经这样做了,但你又一次使用了全局变量 tmp,它现在有不需要的副作用。
  • Track the valid and invalid lines in your main function by using any bad data returned from validate_data to know if a line is bad.通过使用从 validate_data 返回的任何错误数据来跟踪主 function 中的有效行和无效行,以了解行是否错误。

Declare a local variable in validate_data, like bad_data = [] and return that instead of tmp:在 validate_data 中声明一个局部变量,如bad_data = []并返回它而不是 tmp:

...
bad_data = []
if nr == False:
    bad_data.append(line[0])
if fn == False:
    bad_data.append(line[1])
if ln == False:
    bad_data.append(line[2])
if date == False:
    bad_data.append(line[3])
if prog == False:
    bad_data.append(line[4])

return bad_data

Now you don't need to check if all the parts are True, just check for each invalid and update bad_data accordingly.现在您不需要检查所有部分是否为 True,只需检查每个无效部分并相应地更新 bad_data。 If bad_data is empty ( [] ) when validate_data returns, that means the line is valid.如果 validate_data 返回时 bad_data 为空 ( [] ),则表示该行有效。

Next, declare valid_lines and invalid_lines in main and update them in your line-loop based on what validate_data returns:接下来,在 main 中声明 valid_lines 和 invalid_lines 并根据 validate_data 返回的内容在你的 line-loop 中更新它们:

def main(csv_file):
    valid_lines = []
    corrupt_lines = []
    
    ...
    
        for line in csv_file:
            line = line.strip()  # modify line to be its stripped self and use that going forward
            bad_data = validate_data(line)
            if bad_data:  # means if bad_data != []
                corrupt_lines.append(line + " => INVALID DATA [" + ", ".join(bad_data) + "]")
            else:
                valid_lines.append(line)
    
    ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM