简体   繁体   中英

Pandas: how to merge horizontally multiple CSV (key,value) files and name `value` columns in the resulting DF using filenames

I have 16 different csv files in one directory and I am trying to load them into one pandas dataframe. Each file has datetime and float64 columns. All CSV files do not have column headers. Directory

location = os.path.join(base_dir, "DirectoryName")
symbols = os.listdir(location)
df = pd.DataFrame(index=dates)
for symbol in symbols:
    location = os.path.join(base_dir, "DirectoryName", symbol)
    df_temp = pd.read_csv(location, index_col=0, parse_dates=True, dayfirst=True, na_values=['nan'])
    df_temp.dropna()
    df_temp.index = df_temp.index.normalize()
    df_temp = normalize_data(df_temp)
    df = df.join(df_temp)

The problem that I have now is the final dataframe df that has datetime as it's index but it's corresponding row values as column names and the first row is filled with NaN

Here is the snapshot 2015-04-02的通知行值

I have to remove the first row of df , but that won't help much while doing other operations as some data will be missing. I couldn't rename the column header as it is different for each file and I know only how to change that statically.

I've downloaded just the following files:

['hash_rate.csv',
 'difficulty.csv',
 'cost_per_tx.csv',
 'block_size.csv',
 'avg_block_size.csv']

That's why you will see just a corresponding part of your data in the resulting DF.

Please find comments in the code.

Code:

import os
import glob
from collections import defaultdict
import pandas as pd

def read_files(filelist):
    # `dfs` - will contain a list of DFs
    # that will be concatenated later on
    dfs = []
    for fn in filelist:
        # parse column name from filename
        col = os.path.splitext(os.path.split(fn)[-1])[0]
        # read individual CSV (as data blocks from defaultdict) into temp DF
        # and add this temporary DF into `dfs` list
        dfs.append(pd.read_csv(
                        fn,
                        parse_dates=[0],
                        header=None,
                        index_col='date',
                        names=['date', col]
                   )
        )
    # return concatenated horizontally (axis=1) DF
    return pd.concat(dfs, axis=1)

def main():
    data_files_mask = r'D:\temp\.data\36827502\*.csv'
    df = read_files(glob.glob(data_files_mask))
    print(df)

if __name__ == '__main__':
    main()

Output:

                     block_size     hash_rate  avg_block_size  cost_per_tx  \
date
2015-01-05 18:15:05     34469.0  3.479099e+08        0.375637     8.185000
2015-01-06 18:15:05     36219.0  3.323940e+08        0.477130     6.598278
2015-01-07 18:15:05     38212.0  3.560892e+08        0.624724     6.232809
2015-01-08 18:15:05     40943.0  4.261981e+08        0.754424     7.113695
2015-01-09 18:15:05     43021.0  4.099610e+08        0.515467     6.199964
2015-01-10 18:15:05     45487.0  4.655484e+08        0.451940     6.821970
2015-01-11 18:15:05     47963.0  4.920513e+08        0.535354     7.958116
2015-01-12 18:15:05     50594.0  6.940933e+08        0.536199     9.415383
2015-02-04 18:15:05     32832.0  3.413843e+08        0.421406     8.054181
2015-02-05 18:15:05     34523.0  3.479099e+08        0.373642     8.958115

                       difficulty
date
2015-01-05 18:15:05  4.761056e+10
2015-01-06 18:15:05  4.880749e+10
2015-01-07 18:15:05  4.940201e+10
2015-01-08 18:15:05  5.227830e+10
2015-01-09 18:15:05  5.425663e+10
2015-01-10 18:15:05  6.081322e+10
2015-01-11 18:15:05  6.225398e+10
2015-01-12 18:15:05  7.272278e+10
2015-02-04 18:15:05  4.671755e+10
2015-02-05 18:15:05  4.761056e+10

Consider explicitly defining columns with read_csv's names argument, using the very file name, symbol in the loop (of course replacing the .csv extension):

for symbol in symbols:
    ...
    df_temp = pd.read_csv(location, 
                          index_col=0, 
                          parse_dates=True, 
                          dayfirst=True, 
                          na_values=['nan'],
                          header=None,
                          names=['date', symbol.replace('.csv', '')])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM