简体   繁体   中英

Merging multiple ohlc csv with overlaps into one sorted csv file with pandas

I'm trying to merge multiple OHLC csv files into one file. I downloaded some data on the internet-web but with indexes partly overlapping. Data might look like:

<file1>
o,h,l,c,v,t
...
364.4,364.4,364.4,364.4,155.0,2019-01-01 10:59:59
364.4,364.59,364.4,364.59,371.0,2019-01-01 11:00:00
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:01

<file2>
o,h,l,c,v,t
364.4,364.59,364.4,364.59,1371.0,2019-01-01 11:00:00
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:01
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:02
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:03
...

<file3>
o,h,l,c,v,t
364.4,364.4,364.4,364.4,155.0,2019-01-01 12:00:00
364.4,364.59,364.4,364.59,1371.0,2019-01-01 12:00:01
...

What I need...

<file_merged>
o,h,l,c,v,t
...
364.4,364.4,364.4,364.4,155.0,2019-01-01 10:59:59
364.4,364.59,364.4,364.59,371.0,2019-01-01 11:00:00
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:01
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:02
364.59,364.59,364.59,364.59,305.0,2019-01-01 11:00:03
...
364.4,364.4,364.4,364.4,155.0,2019-01-01 12:00:00
364.4,364.59,364.4,364.59,1371.0,2019-01-01 12:00:01
...

I have found a neat way to load the data into one object but I'm missing the last part where I need to merge the unordered list into one ordered dataframe, where doubles are removed and data is sorted. pd.merge(left, right, index='t') I think I could use in an ugly loop but I'm wondering if there is a more elegant way? This is how I loaded data so far (without the ugly loop):

datadir = r'/home/test'
files = [i for i in os.listdir(datadir) if os.path.isfile(os.path.join(datadir, i)) and i.startswith('AAPL_')]
datalist = [pd.read_csv(datadir + '/' + filename, index_col='t') for filename in files]

Thanks in advance

Use pathlib.Path.glob to get paths of all the files that start with AAPL_ in datadir and use pd.read_csv along with optional parameter parse_dates and index_col , finally use pd.concat to concat all the dfs and use DataFrame.sort_index to sort the dataframe on index .

from pathlib import Path

dfs = [pd.read_csv(p, parse_dates=['t'])
       for p in Path(datadir).glob(r'AAPL_*')]
df = pd.concat(dfs).drop_duplicates('t').set_index('t').sort_index()

Result:

print(df)
                          o       h       l       c       v
t                                                          
2019-01-01 10:59:59  364.40  364.40  364.40  364.40   155.0
2019-01-01 11:00:00  364.40  364.59  364.40  364.59  1371.0
2019-01-01 11:00:01  364.59  364.59  364.59  364.59   305.0
2019-01-01 11:00:02  364.59  364.59  364.59  364.59   305.0
2019-01-01 11:00:03  364.59  364.59  364.59  364.59   305.0
2019-01-01 12:00:00  364.40  364.40  364.40  364.40   155.0
2019-01-01 12:00:01  364.40  364.59  364.40  364.59  1371.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM