简体   繁体   English

Python在多列和最近的日期时间上合并两个csv文件

[英]Python merge two csv files on multiple columns and nearest datetime

I have two csv files which I would like to merge. 我有两个要合并的csv文件。

File1: 文件1:

rel_id, acc_id, value, timestamp
1, 2, True, 2016-01-04 19:20:22
2, 3, True, 2016-01-04 18:35:56
1, 2, True, 2016-01-04 20:43:12
1, 5, False, 2016-01-04 18:15:20
2, 3, True, 2016-01-04 20:43:11

File2: 文件2:

rel_id, acc_id, value, timestamp
1, 2, 250, 2016-01-04 20:43:13
1, 5, 610, 2016-01-04 18:15:23
2, 3, 400, 2016-01-04 18:35:58
2, 3, 300, 2016-01-04 20:43:13
1, 2, 500, 2016-01-04 19:20:23

I would like to merge the two files based on the rel_id, acc_id and timestamp. 我想基于rel_id,acc_id和timestamp合并两个文件。

Merged(file1 and file2): 合并(文件1和文件2):

rel_id, acc_id, value_file1, timestamp, value_file2
1, 2, True, 2016-01-04 19:20:22, 500
2, 3, True, 2016-01-04 18:35:56, 400
1, 2, True, 2016-01-04 20:43:12, 250
1, 5, False, 2016-01-04 18:15:20, 610
2, 3, True, 2016-01-04 20:43:11, 300

However the timestamp of file2 is slightly later in time. 但是,file2的时间戳在时间上稍晚一些。

Searching on stackoverflow lead me to this post: pandas merge dataframes by closest time 在stackoverflow上搜索将我引到这篇文章: pandas按最接近的时间合并数据帧

But I have no idea how to approach the matching on rel_id, acc_id and timestamp nearest. 但是我不知道如何在最接近的rel_id,acc_id和timestamp上进行匹配。

import pandas as pd


file1 = pd.read_csv('file1.csv')
file2 = pd.read_csv('file2.csv')


file1.columns = ['rel_id', 'acc_id', 'value', 'timestamp']
file2.columns = ['rel_id', 'acc_id', 'value', 'timestamp']


file1['timestamp'] = pd.to_datetime(file1['timestamp'])
file2['timestamp'] = pd.to_datetime(file2['timestamp'])


file1_dt = pd.Series(file1["timestamp"].values, file1["timestamp"])
file1_dt.reindex(file2["timestamp"], method="nearest")
file2["nearest"] = file1_dt.reindex(file2["timestamp"],    method="nearest").values

print file2

I tried above code based on the other post, but this doesn't match on rel_id and acc_id yet. 我根据另一篇文章尝试了上面的代码,但是在rel_id和acc_id上还不匹配。 Plus that above code already raise an error: 加上上面的代码已经引发了错误:

ValueError: index must be monotonic increasing or decreasing ValueError:索引必须是单调递增或递减

Any help is highly appriciated. 高度重视任何帮助。 Thanks. 谢谢。

You're trying to reindex based in unsorted indices. 您正在尝试根据未排序的索引重新建立索引。 Assuming your CSV has no header: 假设您的CSV没有标题:

column_names = ['rel_id', 'acc_id', 'value', 'timestamp']
file1 = pd.read_csv('file1.csv',
                    index_col=['timestamp'],
                    parse_dates='timestamp',
                    header=None,
                    names=column_names).sort_index()
file2 = pd.read_csv('file2.csv',
                    index_col=['timestamp'],
                    parse_dates='timestamp',
                    header=None,
                    names=column_names).sort_index()
file1.set_index(file1.reindex(file2.index, method='nearest').index, inplace=True)



                     rel_id  acc_id  value
timestamp
2016-01-04 18:15:23       1       5  False
2016-01-04 18:35:58       2       3   True
2016-01-04 19:20:23       1       2   True
2016-01-04 20:43:13       2       3   True
2016-01-04 20:43:13       1       2   True

And merge file1 and file2: 并合并file1和file2:

file1.reset_index().merge(file2.reset_index(), on=['acc_id', 'rel_id', 'timestamp']).set_index('timestamp')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM