I have a python script that reads a bunch of csv files and creates a new csv file that contains the last line of each of the files read. The script is this:
import pandas as pd
import glob
import os
path = r'Directory of the files read\*common_file_name_part.csv'
r_path = r'Directory where the resulting file is saved.'
if os.path.exists(r_path + 'csv'):
os.remove(r_path + 'csv')
if os.path.exists(r_path + 'txt'):
os.remove(r_path + 'txt')
files = glob.glob(path)
column_list = [None] * 44
for i in range(44):
column_list[i] = str(i + 1)
df = pd.DataFrame(columns = column_list)
for name in files:
df_n = pd.read_csv(name, names = column_list)
df = df.append(df_n.iloc[-1], ignore_index=True)
del df_n
df.to_csv(r_path + 'csv', index=False, header=False)
del df
The files all have a common name end and a genuine name beginning. The resulting file doesn't have the extension so I can do some checks. My problem is that the files have a variable amount of lines and columns, even inside the same file, and I can't read them properly. If I don't specify the column names, the program assumes the first line as the column names and that leads to a lot of columns being lost from some of the files. Also, I've tried reading the files without headers, by writing:
df = pd.read_csv(r_path, header=None)
but it doesn't seem to work. I wanted to upload some files as an example but I don't know. If someone knows how I'll be happy to do it
You can preprocess your files to sort of fill up the rows with less than max number of columns. Ref: Python csv; get max length of all columns then lengthen all other columns to that length
You can also use sep argument, or, if it fails to read your CSV correctly, read file as fixed width. See answers on this SO question: Read CSV into a dataFrame with varying row lengths using Pandas
It looks like you actually have two problems:
getting a complete list of all the columns in all of the files
reading the last line from each file and merging into the correct columns
To solve this the standard Python csv
module makes more sense than Pandas.
I will assume you have identified the list of files you need and it's in your files
variable
First get all the headers
import csv
# Use a set to eliminate eleminate duplicates
headers = set()
# Read the header from each file
for file in files:
with open(file) as f:
reader = csv.reader(f)
# Read the first line as this will be the header
header = next(reader)
# Update the set with the list of headers
headers.update(header)
print("Headers:", headers)
Now read the last lines and write them to the result file
Use a DictReader
and DictWriter
provide a dict
mapped to the header.
with open(r_path, "w") as f_out:
# The option extrasaction="ignore" allows for not
# all columns to be provided when calling writerow
writer = DictWriter(f_out, fieldnames=headers, extrasaction="ignore")
writer.writeheader()
# Read the last line of each file
for file in files:
with open(file) as f_in:
reader = csv.DictReader(f_in)
# Read all and ignore only keep the last line
for row in reader:
pass
# Write the last row into the result file
writer.writerow(row)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.