简体   繁体   中英

how to merge multiple csv files to the right of eachother in time order? (python)

I currently have downloaded 120 files (10 years, month by month) worth of csv data.

I'm using some code below that merges all of these together into one document that goes in time order, e..g from 1/1/09 to 1/1/19.

from glob import glob
files = sorted(glob('*.csv'))
with open('cat.csv', 'w') as fi_out:
    for i, fname_in in enumerate(files):
        with open(fname_in, 'r') as fi_in:
                if i_line > 0 or i == 0:
                    fi_out.write(line)

This works all fine, however know I have also downloaded the same type of data except for a different product. What I also order all this new data in time order but have it side by side with the old set of data.

I receive an error like so:

Any help would be appreciated.

EDIT1:

Traceback (most recent call last):
  File "/Users/myname/Desktop/collate/asdas.py", line 4, in <module>
    result = pd.merge(data1[['REGION', 'TOTALDEMAND', 'RRP']], data2, on='SETTLEMENTDATE')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 61, in merge
    validate=validate)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 551, in __init__
    self.join_names) = self._get_merge_keys()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 871, in _get_merge_keys
    lk, stacklevel=stacklevel))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 1382, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'SETTLEMENTDATE'

EDIT2:

import pandas as pd
df1 = pd.read_csv("product1.csv") 
df2 = pd.read_csv("product2.csv") 
combine = pd.merge(df1, df2, on='DATE', how='outer')
combine.columns = ['product1_price', 'REGION1', 'DATE', 'product2_price', 'REGION2']
combine[['DATE','product1_price','product2_price']]
combine.to_csv("combine.csv",index=False)

Error:

Traceback (most recent call last):
  File "/Users/george/Desktop/collate/asdas.py", line 5, in <module>
    combine.columns = ['VICRRP', 'REGION1', 'SETTLEMENTDATE', 'QLD1RRP', 'REGION2']
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 4389, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 646, in _set_axis
    self._data.set_axis(axis, labels)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 3323, in set_axis
    'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 9 elements, new values have 5 elements

Load your data into dataframes

import pandas as pd
data1 = pd.read_csv("filename1.csv") 
data2 = pd.read_csv("filename2.csv") 

Merge the two dataframes on SETTLEMENTDATE

result = pd.merge(data1, data2, on='SETTLEMENTDATE')

This assumes that there's a 1-to-1 relationship between settlementdate in the two dataframes. If there's not, there will be duplicates.

EDIT : To remove column "PERIOD TYPE" do

result = pd.merge(data1[['REGION', 'TOTALDEMA', 'RRP', 'SETTLEMENTDATE']], data2, on='SETTLEMENTDATE')

see another option, you can use outer when there could be dates not contained in the two csv files, so keeps all the dates on both csv files

full mockup below:

import pandas as pd 
df1 = pd.DataFrame({
    'SETDATE':['01-06-2013','01-08-2013'],
    'Region':['VIC1','VIC1'],
    'RRP':[1,8]})
df2 = pd.DataFrame({
    'SETDATE':['01-06-2013','01-08-2014'],
    'Region':['QLD1','QLD1'],
    'RRP':[2,4]})

combine = pd.merge(df1, df2, on='SETDATE', how='outer')
combine.columns = ['VICRRP', 'Reg1', 'SETDATE', 'QLD1RRP', 'Reg2']
combine[['SETDATE','VICRRP','QLD1RRP']]

Results below:

SETDATE VICRRP  QLD1RRP
0   01-06-2013  1.0 2.0
1   01-08-2013  8.0 NaN
2   01-08-2014  NaN 4.0

all code below if for python3

python has a standard library module called csv

the library is lazy by default,

meaning that it only reads data when data is asked from the file,

thus it should not consume too much ram!

the code will look something like this, pardon me if there are issues in the code

import csv
vicfilename = 'filename1.csv'
qldfilename = 'filename2.csv'
mergedfilename = 'newfile.csv'

with open(mergedfilename, 'w', newline='') as mergedfile:
    fieldnames = ['SETTLEMENTDATE', 'VIC DEMAND', 'VIC RRP', 'QLD DEMAND', 'QLD RRP']
    writer = csv.DictWriter(mergedfile, fieldnames=fieldnames)
    writer.writeheader()
    with open(vicfilename, 'r', newline='') as vicfile:
        vicreader = csv.DictReader(vicfile)
        with open(qldfilename, 'r', newline='') as qldfile:
            qldreader = csv.DictReader(qldfile)

            for vicrow in vicreader:
                for qldrow in qldreader:
                    if vicrow['SETTLEMENTDATE'] == qldrow['SETTLEMENTDATE']:
                        writer.writerow({'SETTLEMENTDATE': vicrow['SETTLEMENTDATE'],
                                         'VIC DEMAND': vicrow['TOTALDEMAND'],
                                         'VIC RRP': vicrow['RRP'],
                                         'QLD DEMAND': qldrow['TOTALDEMAND'],
                                         'QLD RRP': qldrow['RRP'])
                        break
                qldfile.seek(0)
                qldreader = csv.DictReader(qldfile)

code improvements are welcome !

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM