简体   繁体   中英

Grab one specific column from multiple csv files and merge into one

I'd like to grab only the data in the 4th column from all my csv files and write the data into a single file. Each 4th column has a unique header name with the name of the root folder+csv name like FolderA1

FolderA /

1.csv |INFO  INFO  INFO  FolderA1  INFO
       Apple Apple Apple Orange    Apple

2.csv |INFO  INFO  INFO  FolderA2 INFO
       Apple Apple Apple Cracker  Apple

3.csv |INFO  INFO  INFO  FOLDERA3 INFO
       Apple Apple Apple Orange  Apple

How could I get only the 4th columns data filtered into a single .xlsx file and have the next folders csv's put in a new sheet or separate it from the previous folders csv's?

concentrated.xlsx | FOLDERA1 FOLDERA2 FOLDERA3   FOLDERB1 FOLDERB2 FOLDERB3
                    ORANGE   CRACKER   ORANGE    ORANGE   CRACKER  ORANGE

I would use the usecols parameter that pandas.read_csv comes with.

def read_4th(fn):
    return pd.read_csv(fn, delim_whitespace=1, usecols=[3])

files = ['./1.csv', './2.csv', './3.csv']

big_df = pd.concat([read_4th(fn) for fn in files], axis=1)

big_df.to_excel('./mybigdf.xlsx')

For multiple folders use glob .

Suppose you have a 2 folders 'FolderA' and 'FolderB' both located in folder './' and you want all csv files in both.

from glob import glob

files = glob('./*/*.csv')

then run the rest as specified above.

Other answers have suggested using Pandas as an option, and that will certainly work, but if you are looking for a solution using purely the Python library, you might try using the CSV module and iterators.

The caveat here is that, depending on the number of files you need to concatenate, you might run into memory constraints. But setting that aside, here is one approach.

Basic Python Library

import csv
from glob import glob
from itertools import izip_longest, imap

# Use glob to recursively get all CSV files. Adjust the pattern according to your need
input_files = (open(file_path, 'rb') for file_path in glob('*.csv'))

# Using generators, we can wrap all the CSV files in reader instances
input_readers = (csv.reader(input_file) for input_file in input_files)

with open('output.csv', 'wb') as output_file:
    output_writer = csv.writer(output_file)

    # izip_longest will return a tuple of the next value 
    # for all the iterables passed as parameters
    # In this case, this means the next row for all the input_readers
    for rows in izip_longest(*input_readers):

        # We extract the fourth column in all the rows
        # Note that this presumes that all files have a fourth column.
        # Some error checking/handling might be required if 
        # you are not sure that's the case 
        fourth_columns = imap(lambda row: row[3], rows)

        # Write to the output the row that is all the 
        # fourth columns for all the readers
        output_writer.writerow(fourth_columns)

# Clean up the opened files
map(lambda f: f.close(), input_files)

By using generators, you are minimizing the amount of data to be loaded in memory at once, while maintaining a very Pythonic approach to the problem.

Using the glob module can make it easier to load multiple files with a known pattern, which seems to be your case. Feel free to replace it with some other form of file lookup, such as os.path.walk , if it's a better fit.

Something like this should work:

import pandas as pd

input_file_paths = ['1.csv', '2.csv', '3.csv']

dfs = (pd.read_csv(fname) for fname in input_file_paths)

master_df = pd.concat(
    (df[[c for c in df.columns if c.lower().startswith('folder')]]
        for df in dfs), axis=1)

master_df.to_excel('smth.xlsx')

The df[[c for c in df.columns if c.lower().startswith('folder')]] line is due to the fact that your example has inconsistent formatting of the folder column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM