简体   繁体   中英

pd.DataFrame with scalar values

I want to delete some rows from a CSV file by saving a new CSV after a validation process. I wrote the code below but it causes an error.

with open(path_to_read_csv_file, "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=',')
    for line in csv_reader:
        # if validation(line[specific_column]):
            try:
                df = pd.DataFrame(line)
                df.to_csv(path_to_save_csv_file)

            except Exception as e:
                print('Something Happend!')
                print(e)
                continue

Error:

Something Happend!
If using all scalar values, you must pass an index

I've also tried to add an index value by df = pd.DataFrame(line, index=[0]) , but it only stores the first line with an additional empty column at the beginning. How can solve this?

Another version with line works but I can not reach a specific key value at each line:

inFile = open(path_to_read_csv_file, 'r')
outFile = open(path_to_save_csv_file, 'w')

for line in inFile:
    try:
        print('Analysing:', line)

        # HERE, how can I get the specific column value? I used to use line[specific_column] in the last version
        if validation(line[specific_column]):
            outFile.write(line)
        else:
            continue

    except Exception as e:
        print('Something Happend!')
        print(e)
        continue

outFile.close()
inFile.close()

This should help you. Basically, you cannot create a DataFrame from scalar-values only. They have to be wrapped in eg. a list .

The constructor pd.DataFrame expects you to tell how the data that you have provided has to be indexed as well. This is documented here .

The function csv.DictReader uses

the values in the first row of file f will be used as the fieldnames.

For more information, refer to the csv documentation .

Hence, each line that is parsed by the csv_reader is a dictionary where the keys are the CSV header and the values are each the row in the particular line.

So for example, if my CSV is:

Header1, Header2, Header3
1,2,3
11,11,33

Then in the first iteration, the line object would be:

{'Header1': '1', 'Header2': '2', 'Header3': '3'}

Now when you supply this to pd.DataFrame , you need to specify what the data is and what the headers/indices are. In this case, the data is ['1', '2', '3'] and the headers/indices are ['Header1', 'Header2', 'Header3'] . These can be extracted by the calls line.values() and line.keys() respectively.

This is the change I have made.

with open(path_to_read_csv_file, "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=',')
    for line in csv_reader:
        try:
            # validation ...
            df = pd.DataFrame(line.values(), line.keys())
            df.to_csv(path_to_save_csv_file)

        except Exception as e:
            print('Something Happend!')
            print(e)
            continue

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM