简体   繁体   English

根据特定列合并多个 CSV 文件 - Python

[英]Merging multiple CSV files based on specific column - Python

I'm trying to combine about 101 CSV files in Pandas.我正在尝试在 Pandas 中组合大约 101 个 CSV 文件。 Each file has the 2 time columns, and a 'value' column.每个文件都有 2 个时间列和一个“值”列。 I'd like to keep the 2 times columns as they are the same across the CSV files, and then merge the 'value' column from each of the 101 CSVs into a new CSV file.我想保留 2 次列,因为它们在 CSV 文件中是相同的,然后将 101 个 CSV 中的每一个的“值”列合并到一个新的 CSV 文件中。

Using pd.merge I can combine 2 files using the below使用 pd.merge 我可以使用以下合并 2 个文件

data1 = {'time': ['00:00','01:00','02:00'], 
        'local_time': ['09:30','10:30','11:30'],
        'value': ['265.591','330.766','360.962']}

data2 = {'time': ['00:00','01:00','02:00'], 
        'local_time': ['09:30','10:30','11:30'],
        'value': ['521.217','588.034','588.034']}

df_1 = pd.DataFrame(data1)
df_2 = pd.DataFrame(data2)
locs = ['_A11','_B10']

df_test = pd.merge(df_1,df_2, on=['time','local_time'], how='inner', suffixes = (locs)
)

print(df_test)

This yields:这产生:

    time local_time value_A11 value_B10
0  00:00      09:30   265.591   521.217
1  01:00      10:30   330.766   588.034
2  02:00      11:30   360.962   588.034

However, I'm not quite sure how to combine the next 99 csv files or if this even the best way to approach this task.但是,我不太确定如何组合接下来的 99 个 csv 文件,或者这是否是完成此任务的最佳方法。

I'm aiming to get something like:我的目标是得到类似的东西:

    time local_time value_A11 value_B10 value_B11 ...
0  00:00      09:30   265.591   521.217       123 ...
1  01:00      10:30   330.766   588.034       456 ...
2  02:00      11:30   360.962   588.034       789 ...

Any help would be very much appreciated!任何帮助将不胜感激!

EDIT 1:编辑1:

Colin's example worked, however I've been loading in the dataframes into an array as such: Colin 的示例有效,但是我一直在将数据帧加载到这样的数组中:

import glob
import os

# create and sort list of file names in folder
fl = glob.glob('*.csv')
sorted_fl = sorted(fl)

# open csv files from list and store in df
df_list = [pd.read_csv(f, header=3) for f in sorted_fl]

#test df
df_list[0]

I was wondering how I could amend the for loop so that it can feed the array through?我想知道如何修改 for 循环以便它可以为数组提供数据? Thanks again!再次感谢!

EDIT 2: Errors from answer to edit 1编辑 2:从答案到编辑 1 的错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-144-772c1d15f228> in <module>
     14 # loop through each dataframe and merge it with existing one
     15 for i, df in enumerate(df_list[1:]):
---> 16   df_output = pd.merge(df_list[0], df, on=['time','local_time'], how='inner', suffixes = (['_' + str(i), '_' + str(i+1)]))

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     79         copy=copy,
     80         indicator=indicator,
---> 81         validate=validate,
     82     )
     83     return op.get_result()

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    628         # validate the merge keys dtypes. We may need to coerce
    629         # to avoid incompat dtypes
--> 630         self._maybe_coerce_merge_keys()
    631 
    632         # If argument passed to validate,

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
   1136                     inferred_right in string_types and inferred_left not in string_types
   1137                 ):
-> 1138                     raise ValueError(msg)
   1139 
   1140             # datetimelikes must match exactly

ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat

EDIT 3编辑 3

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-cce982321079> in <module>
     11 
     12 # change datatype to datetime for first df
---> 13 df['local_time'] = pd.to_datetime(df_list[0]['local_time'])
     14 df['time'] = pd.to_datetime(df_list[0]['time'])
     15 

NameError: name 'df' is not defined

This seems like a good approach.这似乎是一个很好的方法。 I would just set up the merge and suffixes a little differently so you can loop through each dataframe, like below.我只是设置了一些不同的合并和后缀,这样你就可以遍历每个 dataframe,如下所示。 Each new value column will be merged to df_test.每个新值列都将合并到 df_test。

EDIT: Updated code to work with OP's edit编辑:更新代码以使用 OP 的编辑

EDIT 2: Fixed datatype for OP's error编辑 2:修复了 OP 错误的数据类型

import pandas as pd    
import glob
import os

# create and sort list of file names in folder
fl = glob.glob('*.csv')
sorted_fl = sorted(fl)

# open csv files from list and store in df
df_list = [pd.read_csv(f, header=3) for f in sorted_fl]

# change datatype to datetime for first df
df['local_time'] = pd.to_datetime(df_list[0]['local_time'])
df['time'] = pd.to_datetime(df_list[0]['time'])


# loop through each dataframe and merge it with existing one
for i, df in enumerate(df_list[1:]):

  # change datatype to datetime
  df['local_time'] = pd.to_datetime(df['local_time'])
  df['time'] = pd.to_datetime(df['time'])

  df_output = pd.merge(df_list[0], df, on=['time','local_time'], how='inner', suffixes = (['_' + str(i), '_' + str(i+1)]))

#print(df_output)
'''
    time local_time  value_0  value_1  value_2  value_3
0  00:00      09:30  738.591  265.591  521.217  856.217
1  01:00      10:30  217.766  330.766  588.034  346.034
2  02:00      11:30  295.962  360.962  588.034  645.034
'''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM