简体   繁体   中英

How can i read the maximum amount of lines in a csv file?

I have a python script that reads a bunch of csv files and creates a new csv file that contains the last line of each of the files read. The script is this:

    import pandas as pd
    import glob
    import os

    path = r'Directory of the files read\*common_file_name_part.csv'
    r_path = r'Directory where the resulting file is saved.'
    if os.path.exists(r_path + 'csv'):
       os.remove(r_path + 'csv')
    if os.path.exists(r_path + 'txt'):
       os.remove(r_path + 'txt')

    files = glob.glob(path)
    column_list = [None] * 44
    for i in range(44):
        column_list[i] = str(i + 1)

    df = pd.DataFrame(columns = column_list)
    for name in files:
        df_n = pd.read_csv(name, names = column_list)
        df = df.append(df_n.iloc[-1], ignore_index=True)
        del df_n

    df.to_csv(r_path + 'csv', index=False, header=False)
    del df

The files all have a common name end and a genuine name beginning. The resulting file doesn't have the extension so I can do some checks. My problem is that the files have a variable amount of lines and columns, even inside the same file, and I can't read them properly. If I don't specify the column names, the program assumes the first line as the column names and that leads to a lot of columns being lost from some of the files. Also, I've tried reading the files without headers, by writing:

    df = pd.read_csv(r_path, header=None)

but it doesn't seem to work. I wanted to upload some files as an example but I don't know. If someone knows how I'll be happy to do it

You can preprocess your files to sort of fill up the rows with less than max number of columns. Ref: Python csv; get max length of all columns then lengthen all other columns to that length

You can also use sep argument, or, if it fails to read your CSV correctly, read file as fixed width. See answers on this SO question: Read CSV into a dataFrame with varying row lengths using Pandas

It looks like you actually have two problems:

  1. getting a complete list of all the columns in all of the files

  2. reading the last line from each file and merging into the correct columns

To solve this the standard Python csv module makes more sense than Pandas.

I will assume you have identified the list of files you need and it's in your files variable

First get all the headers

import csv

# Use a set to eliminate eleminate duplicates
headers = set()

# Read the header from each file
for file in files:
    with open(file) as f:
        reader = csv.reader(f)

        # Read the first line as this will be the header
        header = next(reader)

        # Update the set with the list of headers
        headers.update(header)

print("Headers:", headers)

Now read the last lines and write them to the result file

Use a DictReader and DictWriter provide a dict mapped to the header.

with open(r_path, "w") as f_out:
    # The option extrasaction="ignore" allows for not
    # all columns to be provided when calling writerow
    writer = DictWriter(f_out, fieldnames=headers, extrasaction="ignore")
    writer.writeheader()

    # Read the last line of each file
    for file in files:
        with open(file) as f_in:
            reader = csv.DictReader(f_in)

            # Read all and ignore only keep the last line
            for row in reader: 
                pass

            # Write the last row into the result file
            writer.writerow(row)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM