简体   繁体   中英

Best tool to combine CSV files with different structure?

I have multiple large CSV files. These CSV files have few column differences. For feeding them to AWS QuickSight for data visualization, I want to unify the structure of these CSV files. I think for doing this I have two ways:

  • Add the missing columns to each CSV file so all of them look the same
  • Combine all the CSV files into one large file

What is the best tool for doing this?

Is there any tool that can show structural difference of two CSV file? If I find out which columns are missing I can also add them manually.

With pandas I can combine the CSV files, but in the way I know, I should name all the columns (code below) and this is not useful.

import pandas as pd

df1 = pd.DataFrame({'column1': [1,2],
                    'column2': [3,4],
                    })

df2 = pd.DataFrame({'column1': [5,6],
                    'column3': [7,8],
                    })
pd.concat([df1,df2],ignore_index=True)

Result:

   column1  column2  column3
0        1      3.0      NaN
1        2      4.0      NaN
2        5      NaN      7.0
3        6      NaN      8.0

I cannot tell you what "the best" tool is; that's subjective with many dependencies.

I can tell you that miller should probably be on your short list for tools to consider for working with CSV data. Also see the miller GitHub site. One last thing: the author is super-helpful.

I have it on good authority that the following will do the job:

mlr --csv reshape -r "^A" -o item,value then reshape -s item,value \ then unsparsify --fill-with "" *.csv > result.csv

Some notes about the command:

  • reshape -r "^A" -o item,value, to transform the input CSVs from wide to long, applying this to all the fields whose name begins with "A";
  • reshape -s item,value, to transform the previous output from long to wide;
  • unsparsify --fill-with "", to manage field names over all input records. For field names absent in a given record but present in others, fills in the value "".

As much as I enjoyed using miller to solve this problem, I also did enjoy using panda to combine CSV files. Here is the python code to achieve it:

import pandas as pd
import glob
import os

path = os.getcwd() # use your path
# read all the files' name
all_files = glob.glob(os.path.join(path , "*.csv"))

li = []

# read csv files and create a DataFrame list
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

# join all the DataFrames
result_frame = pd.concat(li, ignore_index=True)

# export the result to a csv file
result_frame.to_csv('endresult.csv')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM