格式化来自 CSV 文件 Python 的无效数据

Question

for an assignment I need to validate a dataset (CSV file) that I got.对于作业，我需要验证我得到的数据集（CSV 文件）。 It contains information of students: student number, first name, last name, date of birth, study program.它包含学生的信息：学号，名字，姓氏，出生日期，学习计划。 I did that (checking for valid and invalid data) already, but for clarity, these are the requirements for that:我已经这样做了（检查有效和无效数据），但为了清楚起见，这些是要求：

Student number has this format: 7 digits, starting with 0 and second digit (from left) can be either 9 or 8. Example: 0212345 is not valid学号格式如下：7 位数字，从 0 开始，第二位数字（左起）可以是 9 或 8。示例：0212345 无效
First name and last names, contains only alphabet.名字和姓氏，仅包含字母。
Date of birth has this format: YYYY-MM-DD.出生日期采用以下格式：YYYY-MM-DD。 Days between 1 and 31, months between 1 and 12 and Years between 1960 and 2004. 1 到 31 之间的天，1 到 12 之间的月以及 1960 到 2004 之间的年。
Study program can have one of these values: INF, TINF, CMD, AI.研究计划可以具有以下值之一：INF、TINF、CMD、AI。

Then I need to print the valid and corrupt lines in the following format:然后我需要按以下格式打印有效行和损坏行：

### VALID LINES ###
0873226,Junette,Gurry,1987-12-05,CMD
0983960,Leoline,MacCaughen,1993-02-12,TINF

### CORRUPT LINES ###
0773226,Junette,Gur_ry,1995-12-05, => INVALID DATA: ['0773226', 'Gur_ry', '']
0795003,Edna,Douce,1957-06-23,INF => INVALID DATA: ['0795003', '1957-06-23']

Printing the valid lines works just fine, the problem I'm having is that I can't seem to print the invalid data from the corrupt lines correctly.打印有效行工作得很好，我遇到的问题是我似乎无法正确打印损坏行中的无效数据。 I've been trying different things now for hours, but I cannot find a solution that works.几个小时以来，我一直在尝试不同的事情，但我找不到有效的解决方案。 Is there anyone who can help me out?有没有人可以帮助我？ I'll provide my code and a piece of the CSV file here down below.我将在下面提供我的代码和 CSV 文件的一部分。

My code:我的代码：

import os
import sys
from datetime import datetime

valid_lines = []
corrupt_lines = []
tmp = []

n = 1
def validate_data(line):
    global n
    nr = False
    fn = False
    ln = False
    date = False
    prog = False
    line2 = line
    line = line.split(",")
    # Checking if the student number meets the requirements
    try:
      if line[0][0] == "0" and len(line[0]) == 7:
        if line[0][1] == "9" or line[0][1] == "8":
            nr = True
        else:
            pass
      else:
             pass
    except:
            pass
    # Checking if the first name meets the requirements
    try:
        if line[1] == '':
            pass
        elif line[1].isalpha:
            fn = True
            # print(True)
        else:
            pass
    except:
            pass
    # Checking if the last name meets the requirements
    try:
        if line[2] == '':
            pass
        elif line[2].isalpha and line[2] != "123124" and "^" not in line[2]:
            ln = True
        else:
            pass
    except:
            pass
    # Checking if the the date meets the requirements
    format = "%Y-%m-%d"
    try:
        date = bool(datetime.strptime(line[3], format))
        if date == True:
          year, month, day = line[3].split("-")
          if int(year) >=1960 and int(year) <=2004:
             date = True
          else:
              date = False
              pass
        else:
            pass
    except ValueError:
        pass
    # Checking if the study program meets the requirements
    list = ["INF", "TINF", "CMD", "AI"]
    try:
        if line[4] in list:
            prog = True
        else:
            pass
    except ValueError:
          pass
    if nr == True and fn == True and ln == True and date == True and prog == True:
        valid_lines.append(line2)
    else:
        # Trying to create a list with the invalid data.
        corrupt_lines.append(line2)
        if nr == False:
            tmp.append(line[0])
        if fn == False:
            tmp.append(line[1])
        if ln == False:
            tmp.append(line[2])
        if date == False:
            tmp.append(line[3])
        if prog == False:
            tmp.append(line[4])
    return tmp

def main(csv_file):
    with open(os.path.join(sys.path[0], csv_file), newline='') as csv_file:
        # skip header line
        next(csv_file)

        for line in csv_file:
            validate_data(line.strip())

    print('### VALID LINES ###')
    print("\n".join(valid_lines))
    print('### CORRUPT LINES ###')
    print(" => INVALID DATA [] \n".join(corrupt_lines))

if __name__ == "__main__":    
    main('students.csv')

And a bit of the CSV file:还有一点 CSV 文件：

studentnumber,firstname,lastname,dateofbirth,studyprogram
0873226,Junette,Gurry,1987-12-05,CMD
0983960,Leoline,MacCaughen,1993-02-12,TINF
0875514,Derrik,Garnson,2007-06-23,CMD
0807295,Christy,Rodwell,1997-09-05,CMD
0844343,Frannie,555,1997-05-08,TINF
0798488,Darbie,Habbijam,1997-10-11,AI
0973065,Glory,McLernon,2007-07-20,AI
0803417,Selie,Gunter,1974-01-05,DS
0963866,Wyatan,Lidgey,1987-08-23,DS
0946101,Rubie,De Lorenzo,1972-01-20,CMD
0834576,Bendite,Jeenes,1974-12-10,DS
0982484,Terra,Eckert,1977-11-22,TINF
0755219,Jacky,Driuzzi,1980-07-27,CMD
0970338,Nariko,Blackley,2006-07-14,DS
0869610,,,,CMD

Thanks in advance, I really appreciate it!在此先感谢，我真的很感激！

Answer 1

The one big problem is that you're using global variables and having validate_data update those.一个大问题是您正在使用全局变量并让 validate_data 更新它们。 While you can know which lines are bad, you cannot know what parts of them are bad.虽然您可以知道哪些线路是坏的，但您无法知道它们的哪些部分是坏的。

I recommend a decent restructure:我推荐一个体面的重组：

Get rid of the global variables and have validate_data return the parts of a line that are bad, if any.摆脱全局变量并让 validate_data 返回一行中错误的部分（如果有的话）。 You're kinda already doing that, but again you're using the global variable tmp which has unwanted side effects right now.你有点已经这样做了，但你又一次使用了全局变量 tmp，它现在有不需要的副作用。
Track the valid and invalid lines in your main function by using any bad data returned from validate_data to know if a line is bad.通过使用从 validate_data 返回的任何错误数据来跟踪主 function 中的有效行和无效行，以了解行是否错误。

Declare a local variable in validate_data, like bad_data = [] and return that instead of tmp:在 validate_data 中声明一个局部变量，如bad_data = []并返回它而不是 tmp：

...
bad_data = []
if nr == False:
    bad_data.append(line[0])
if fn == False:
    bad_data.append(line[1])
if ln == False:
    bad_data.append(line[2])
if date == False:
    bad_data.append(line[3])
if prog == False:
    bad_data.append(line[4])

return bad_data

Now you don't need to check if all the parts are True, just check for each invalid and update bad_data accordingly.现在您不需要检查所有部分是否为 True，只需检查每个无效部分并相应地更新 bad_data。 If bad_data is empty ( [] ) when validate_data returns, that means the line is valid.如果 validate_data 返回时 bad_data 为空 ( [] )，则表示该行有效。

Next, declare valid_lines and invalid_lines in main and update them in your line-loop based on what validate_data returns:接下来，在 main 中声明 valid_lines 和 invalid_lines 并根据 validate_data 返回的内容在你的 line-loop 中更新它们：

def main(csv_file):
    valid_lines = []
    corrupt_lines = []
    
    ...
    
        for line in csv_file:
            line = line.strip()  # modify line to be its stripped self and use that going forward
            bad_data = validate_data(line)
            if bad_data:  # means if bad_data != []
                corrupt_lines.append(line + " => INVALID DATA [" + ", ".join(bad_data) + "]")
            else:
                valid_lines.append(line)
    
    ...

格式化来自 CSV 文件 Python 的无效数据

问题描述

1 个解决方案

解决方案1
0 2023-01-05 20:37:11

格式化来自 CSV 文件 Python 的无效数据

问题描述

1 个解决方案

解决方案1 0 2023-01-05 20:37:11

解决方案1
0 2023-01-05 20:37:11