[英]Merging multiple CSV files based on specific column - Python
I'm trying to combine about 101 CSV files in Pandas.我正在尝试在 Pandas 中组合大约 101 个 CSV 文件。 Each file has the 2 time columns, and a 'value' column.
每个文件都有 2 个时间列和一个“值”列。 I'd like to keep the 2 times columns as they are the same across the CSV files, and then merge the 'value' column from each of the 101 CSVs into a new CSV file.
我想保留 2 次列,因为它们在 CSV 文件中是相同的,然后将 101 个 CSV 中的每一个的“值”列合并到一个新的 CSV 文件中。
Using pd.merge I can combine 2 files using the below使用 pd.merge 我可以使用以下合并 2 个文件
data1 = {'time': ['00:00','01:00','02:00'],
'local_time': ['09:30','10:30','11:30'],
'value': ['265.591','330.766','360.962']}
data2 = {'time': ['00:00','01:00','02:00'],
'local_time': ['09:30','10:30','11:30'],
'value': ['521.217','588.034','588.034']}
df_1 = pd.DataFrame(data1)
df_2 = pd.DataFrame(data2)
locs = ['_A11','_B10']
df_test = pd.merge(df_1,df_2, on=['time','local_time'], how='inner', suffixes = (locs)
)
print(df_test)
This yields:这产生:
time local_time value_A11 value_B10
0 00:00 09:30 265.591 521.217
1 01:00 10:30 330.766 588.034
2 02:00 11:30 360.962 588.034
However, I'm not quite sure how to combine the next 99 csv files or if this even the best way to approach this task.但是,我不太确定如何组合接下来的 99 个 csv 文件,或者这是否是完成此任务的最佳方法。
I'm aiming to get something like:我的目标是得到类似的东西:
time local_time value_A11 value_B10 value_B11 ...
0 00:00 09:30 265.591 521.217 123 ...
1 01:00 10:30 330.766 588.034 456 ...
2 02:00 11:30 360.962 588.034 789 ...
Any help would be very much appreciated!任何帮助将不胜感激!
EDIT 1:编辑1:
Colin's example worked, however I've been loading in the dataframes into an array as such: Colin 的示例有效,但是我一直在将数据帧加载到这样的数组中:
import glob
import os
# create and sort list of file names in folder
fl = glob.glob('*.csv')
sorted_fl = sorted(fl)
# open csv files from list and store in df
df_list = [pd.read_csv(f, header=3) for f in sorted_fl]
#test df
df_list[0]
I was wondering how I could amend the for loop so that it can feed the array through?我想知道如何修改 for 循环以便它可以为数组提供数据? Thanks again!
再次感谢!
EDIT 2: Errors from answer to edit 1编辑 2:从答案到编辑 1 的错误
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-144-772c1d15f228> in <module>
14 # loop through each dataframe and merge it with existing one
15 for i, df in enumerate(df_list[1:]):
---> 16 df_output = pd.merge(df_list[0], df, on=['time','local_time'], how='inner', suffixes = (['_' + str(i), '_' + str(i+1)]))
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
79 copy=copy,
80 indicator=indicator,
---> 81 validate=validate,
82 )
83 return op.get_result()
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
628 # validate the merge keys dtypes. We may need to coerce
629 # to avoid incompat dtypes
--> 630 self._maybe_coerce_merge_keys()
631
632 # If argument passed to validate,
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
1136 inferred_right in string_types and inferred_left not in string_types
1137 ):
-> 1138 raise ValueError(msg)
1139
1140 # datetimelikes must match exactly
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
EDIT 3编辑 3
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-cce982321079> in <module>
11
12 # change datatype to datetime for first df
---> 13 df['local_time'] = pd.to_datetime(df_list[0]['local_time'])
14 df['time'] = pd.to_datetime(df_list[0]['time'])
15
NameError: name 'df' is not defined
This seems like a good approach.这似乎是一个很好的方法。 I would just set up the merge and suffixes a little differently so you can loop through each dataframe, like below.
我只是设置了一些不同的合并和后缀,这样你就可以遍历每个 dataframe,如下所示。 Each new value column will be merged to df_test.
每个新值列都将合并到 df_test。
EDIT: Updated code to work with OP's edit编辑:更新代码以使用 OP 的编辑
EDIT 2: Fixed datatype for OP's error编辑 2:修复了 OP 错误的数据类型
import pandas as pd
import glob
import os
# create and sort list of file names in folder
fl = glob.glob('*.csv')
sorted_fl = sorted(fl)
# open csv files from list and store in df
df_list = [pd.read_csv(f, header=3) for f in sorted_fl]
# change datatype to datetime for first df
df['local_time'] = pd.to_datetime(df_list[0]['local_time'])
df['time'] = pd.to_datetime(df_list[0]['time'])
# loop through each dataframe and merge it with existing one
for i, df in enumerate(df_list[1:]):
# change datatype to datetime
df['local_time'] = pd.to_datetime(df['local_time'])
df['time'] = pd.to_datetime(df['time'])
df_output = pd.merge(df_list[0], df, on=['time','local_time'], how='inner', suffixes = (['_' + str(i), '_' + str(i+1)]))
#print(df_output)
'''
time local_time value_0 value_1 value_2 value_3
0 00:00 09:30 738.591 265.591 521.217 856.217
1 01:00 10:30 217.766 330.766 588.034 346.034
2 02:00 11:30 295.962 360.962 588.034 645.034
'''
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.