简体   繁体   English

合并两个 pandas 数据帧,一个日期不频繁,应按最近日期合并

[英]Merge two pandas data frames, one has infrequent dates and should be merged by the most recent date

a , b are pandas data frames, and a updates less frequently than b . a , b是 pandas 数据帧, a更新频率低于b Eg例如

a = pd.DataFrame({'id': np.array([1, 3, 4, 9]*2),
                  'date': np.repeat(['2021-01-03', '2021-02-06'], 4),
                  'score': np.linspace(0, 1, 8)})
a['date'] = pd.to_datetime(a['date'])

b = pd.DataFrame({'id': np.array([1, 3, 4, 9]*5),
                  'date': np.repeat(['2021-01-03', '2021-01-15', '2021-01-23', '2021-02-08', '2021-02-17'], 4),
                  'value': np.linspace(0, 1, 20)})
b['date'] = pd.to_datetime(b['date'])

I want to merge the two frames, by matching the ids and the date in b with the most recent date in a , so in this example I want the following pairings of the dates for the merge:我想通过将 b 中的 id 和日期与ba最新日期匹配来合并两个帧,因此在此示例中,我需要以下日期配对以进行合并:

b          -> a
2021-01-03 -> 2021-01-03
2021-01-15 -> 2021-01-03
2021-01-23 -> 2021-01-03
2021-02-08 -> 2021-02-06
2021-02-17 -> 2021-02-06

I can do this with a for-loop over each of the dates in a , selecting the data in b that lies between each pair of adjacent dates in a , adding the score from a as a new column, and then concatenating these frames together, but is there a faster way to do this?我可以对 a 中的每个日期进行 for 循环,选择b中位于a中每对相邻日期之间的数据,将a中的score添加a新列,然后将这些帧连接在一起,但是有没有更快的方法来做到这一点?

Use merge_asof by on and by parameters:通过onby参数使用merge_asof

df = pd.merge_asof(b, a, on='date', by='id')

For test was renamed column to date1 :对于 test 已将列重命名为date1

a = pd.DataFrame({'id': np.array([1, 3, 4, 9]*2),
                  'date': np.repeat(['2021-01-03', '2021-02-06'], 4),
                  'score': np.linspace(0, 1, 8)})
a['date'] = pd.to_datetime(a['date'])

b = pd.DataFrame({'id': np.array([1, 3, 4, 9]*5),
                  'date1': np.repeat(['2021-01-03', '2021-01-15', '2021-01-23', '2021-02-08', '2021-02-17'], 4),
                  'value': np.linspace(0, 1, 20)})
b['date1'] = pd.to_datetime(b['date1'])

df = pd.merge_asof(b, a, left_on='date1', right_on='date', by='id')
print (df)
    id      date1     value       date     score
0    1 2021-01-03  0.000000 2021-01-03  0.000000
1    3 2021-01-03  0.052632 2021-01-03  0.142857
2    4 2021-01-03  0.105263 2021-01-03  0.285714
3    9 2021-01-03  0.157895 2021-01-03  0.428571
4    1 2021-01-15  0.210526 2021-01-03  0.000000
5    3 2021-01-15  0.263158 2021-01-03  0.142857
6    4 2021-01-15  0.315789 2021-01-03  0.285714
7    9 2021-01-15  0.368421 2021-01-03  0.428571
8    1 2021-01-23  0.421053 2021-01-03  0.000000
9    3 2021-01-23  0.473684 2021-01-03  0.142857
10   4 2021-01-23  0.526316 2021-01-03  0.285714
11   9 2021-01-23  0.578947 2021-01-03  0.428571
12   1 2021-02-08  0.631579 2021-02-06  0.571429
13   3 2021-02-08  0.684211 2021-02-06  0.714286
14   4 2021-02-08  0.736842 2021-02-06  0.857143
15   9 2021-02-08  0.789474 2021-02-06  1.000000
16   1 2021-02-17  0.842105 2021-02-06  0.571429
17   3 2021-02-17  0.894737 2021-02-06  0.714286
18   4 2021-02-17  0.947368 2021-02-06  0.857143
19   9 2021-02-17  1.000000 2021-02-06  1.000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM