I'm trying to merge multiple OHLC csv files into one file. I downloaded some data on the internet-web but with indexes partly overlapping. Data might look like:
<file1>
o,h,l,c,v,t
...
364.4,364.4,364.4,364.4,155.0,2019-01-01 10:59:59
364.4,364.59,364.4,364.59,371.0,2019-01-01 11:00:00
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:01
<file2>
o,h,l,c,v,t
364.4,364.59,364.4,364.59,1371.0,2019-01-01 11:00:00
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:01
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:02
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:03
...
<file3>
o,h,l,c,v,t
364.4,364.4,364.4,364.4,155.0,2019-01-01 12:00:00
364.4,364.59,364.4,364.59,1371.0,2019-01-01 12:00:01
...
What I need...
<file_merged>
o,h,l,c,v,t
...
364.4,364.4,364.4,364.4,155.0,2019-01-01 10:59:59
364.4,364.59,364.4,364.59,371.0,2019-01-01 11:00:00
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:01
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:02
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:03
...
364.4,364.4,364.4,364.4,155.0,2019-01-01 12:00:00
364.4,364.59,364.4,364.59,1371.0,2019-01-01 12:00:01
...
I have found a neat way to load the data into one object but I'm missing the last part where I need to merge the unordered list into one ordered dataframe, where doubles are removed and data is sorted. pd.merge(left, right, index='t') I think I could use in an ugly loop but I'm wondering if there is a more elegant way? This is how I loaded data so far (without the ugly loop):
datadir = r'/home/test'
files = [i for i in os.listdir(datadir) if os.path.isfile(os.path.join(datadir, i)) and i.startswith('AAPL_')]
datalist = [pd.read_csv(datadir + '/' + filename, index_col='t') for filename in files]
Thanks in advance
Use pathlib.Path.glob
to get paths of all the files that start with AAPL_
in datadir
and use pd.read_csv
along with optional parameter parse_dates
and index_col
, finally use pd.concat
to concat all the dfs
and use DataFrame.sort_index
to sort the dataframe on index
.
from pathlib import Path
dfs = [pd.read_csv(p, parse_dates=['t'])
for p in Path(datadir).glob(r'AAPL_*')]
df = pd.concat(dfs).drop_duplicates('t').set_index('t').sort_index()
Result:
print(df)
o h l c v
t
2019-01-01 10:59:59 364.40 364.40 364.40 364.40 155.0
2019-01-01 11:00:00 364.40 364.59 364.40 364.59 1371.0
2019-01-01 11:00:01 364.59 364.59 364.59 364.59 305.0
2019-01-01 11:00:02 364.59 364.59 364.59 364.59 305.0
2019-01-01 11:00:03 364.59 364.59 364.59 364.59 305.0
2019-01-01 12:00:00 364.40 364.40 364.40 364.40 155.0
2019-01-01 12:00:01 364.40 364.59 364.40 364.59 1371.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.