I'd like to grab only the data in the 4th column from all my csv files and write the data into a single file. Each 4th column has a unique header name with the name of the root folder+csv name like FolderA1
FolderA /
1.csv |INFO INFO INFO FolderA1 INFO
Apple Apple Apple Orange Apple
2.csv |INFO INFO INFO FolderA2 INFO
Apple Apple Apple Cracker Apple
3.csv |INFO INFO INFO FOLDERA3 INFO
Apple Apple Apple Orange Apple
How could I get only the 4th columns data filtered into a single .xlsx
file and have the next folders csv's put in a new sheet or separate it from the previous folders csv's?
concentrated.xlsx | FOLDERA1 FOLDERA2 FOLDERA3 FOLDERB1 FOLDERB2 FOLDERB3
ORANGE CRACKER ORANGE ORANGE CRACKER ORANGE
I would use the usecols
parameter that pandas.read_csv
comes with.
def read_4th(fn):
return pd.read_csv(fn, delim_whitespace=1, usecols=[3])
files = ['./1.csv', './2.csv', './3.csv']
big_df = pd.concat([read_4th(fn) for fn in files], axis=1)
big_df.to_excel('./mybigdf.xlsx')
For multiple folders use glob
.
Suppose you have a 2 folders 'FolderA' and 'FolderB' both located in folder './' and you want all csv files in both.
from glob import glob
files = glob('./*/*.csv')
then run the rest as specified above.
Other answers have suggested using Pandas as an option, and that will certainly work, but if you are looking for a solution using purely the Python library, you might try using the CSV module and iterators.
The caveat here is that, depending on the number of files you need to concatenate, you might run into memory constraints. But setting that aside, here is one approach.
import csv
from glob import glob
from itertools import izip_longest, imap
# Use glob to recursively get all CSV files. Adjust the pattern according to your need
input_files = (open(file_path, 'rb') for file_path in glob('*.csv'))
# Using generators, we can wrap all the CSV files in reader instances
input_readers = (csv.reader(input_file) for input_file in input_files)
with open('output.csv', 'wb') as output_file:
output_writer = csv.writer(output_file)
# izip_longest will return a tuple of the next value
# for all the iterables passed as parameters
# In this case, this means the next row for all the input_readers
for rows in izip_longest(*input_readers):
# We extract the fourth column in all the rows
# Note that this presumes that all files have a fourth column.
# Some error checking/handling might be required if
# you are not sure that's the case
fourth_columns = imap(lambda row: row[3], rows)
# Write to the output the row that is all the
# fourth columns for all the readers
output_writer.writerow(fourth_columns)
# Clean up the opened files
map(lambda f: f.close(), input_files)
By using generators, you are minimizing the amount of data to be loaded in memory at once, while maintaining a very Pythonic approach to the problem.
Using the glob module can make it easier to load multiple files with a known pattern, which seems to be your case. Feel free to replace it with some other form of file lookup, such as os.path.walk , if it's a better fit.
Something like this should work:
import pandas as pd
input_file_paths = ['1.csv', '2.csv', '3.csv']
dfs = (pd.read_csv(fname) for fname in input_file_paths)
master_df = pd.concat(
(df[[c for c in df.columns if c.lower().startswith('folder')]]
for df in dfs), axis=1)
master_df.to_excel('smth.xlsx')
The df[[c for c in df.columns if c.lower().startswith('folder')]]
line is due to the fact that your example has inconsistent formatting of the folder column.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.