I have 16 different csv files in one directory and I am trying to load them into one pandas dataframe. Each file has datetime
and float64
columns. All CSV files do not have column headers. Directory
location = os.path.join(base_dir, "DirectoryName")
symbols = os.listdir(location)
df = pd.DataFrame(index=dates)
for symbol in symbols:
location = os.path.join(base_dir, "DirectoryName", symbol)
df_temp = pd.read_csv(location, index_col=0, parse_dates=True, dayfirst=True, na_values=['nan'])
df_temp.dropna()
df_temp.index = df_temp.index.normalize()
df_temp = normalize_data(df_temp)
df = df.join(df_temp)
The problem that I have now is the final dataframe df
that has datetime
as it's index but it's corresponding row values as column names and the first row is filled with NaN
I have to remove the first row of df
, but that won't help much while doing other operations as some data will be missing. I couldn't rename the column header as it is different for each file and I know only how to change that statically.
I've downloaded just the following files:
['hash_rate.csv',
'difficulty.csv',
'cost_per_tx.csv',
'block_size.csv',
'avg_block_size.csv']
That's why you will see just a corresponding part of your data in the resulting DF.
Please find comments in the code.
Code:
import os
import glob
from collections import defaultdict
import pandas as pd
def read_files(filelist):
# `dfs` - will contain a list of DFs
# that will be concatenated later on
dfs = []
for fn in filelist:
# parse column name from filename
col = os.path.splitext(os.path.split(fn)[-1])[0]
# read individual CSV (as data blocks from defaultdict) into temp DF
# and add this temporary DF into `dfs` list
dfs.append(pd.read_csv(
fn,
parse_dates=[0],
header=None,
index_col='date',
names=['date', col]
)
)
# return concatenated horizontally (axis=1) DF
return pd.concat(dfs, axis=1)
def main():
data_files_mask = r'D:\temp\.data\36827502\*.csv'
df = read_files(glob.glob(data_files_mask))
print(df)
if __name__ == '__main__':
main()
Output:
block_size hash_rate avg_block_size cost_per_tx \
date
2015-01-05 18:15:05 34469.0 3.479099e+08 0.375637 8.185000
2015-01-06 18:15:05 36219.0 3.323940e+08 0.477130 6.598278
2015-01-07 18:15:05 38212.0 3.560892e+08 0.624724 6.232809
2015-01-08 18:15:05 40943.0 4.261981e+08 0.754424 7.113695
2015-01-09 18:15:05 43021.0 4.099610e+08 0.515467 6.199964
2015-01-10 18:15:05 45487.0 4.655484e+08 0.451940 6.821970
2015-01-11 18:15:05 47963.0 4.920513e+08 0.535354 7.958116
2015-01-12 18:15:05 50594.0 6.940933e+08 0.536199 9.415383
2015-02-04 18:15:05 32832.0 3.413843e+08 0.421406 8.054181
2015-02-05 18:15:05 34523.0 3.479099e+08 0.373642 8.958115
difficulty
date
2015-01-05 18:15:05 4.761056e+10
2015-01-06 18:15:05 4.880749e+10
2015-01-07 18:15:05 4.940201e+10
2015-01-08 18:15:05 5.227830e+10
2015-01-09 18:15:05 5.425663e+10
2015-01-10 18:15:05 6.081322e+10
2015-01-11 18:15:05 6.225398e+10
2015-01-12 18:15:05 7.272278e+10
2015-02-04 18:15:05 4.671755e+10
2015-02-05 18:15:05 4.761056e+10
Consider explicitly defining columns with read_csv's names
argument, using the very file name, symbol
in the loop (of course replacing the .csv
extension):
for symbol in symbols:
...
df_temp = pd.read_csv(location,
index_col=0,
parse_dates=True,
dayfirst=True,
na_values=['nan'],
header=None,
names=['date', symbol.replace('.csv', '')])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.