简体   繁体   English

如何使用与另一个数据框中最近的日期填充一个数据框中的日期列

[英]how to fill date column in one dataframe with nearest dates from another dataframe

I have a dataframe visit =我有一个数据框visit =

visit_occurrence_id  visit_start_date  person_id
    1                2016-06-01        1
    2                2019-05-01        2
    3                2016-01-22        1
    4                2017-02-14        2
    5                2018-05-11        3

and another dataframe measurement =和另一个数据帧measurement =

measurement_date    person_id   visit_occurrence_id
2017-09-04          1           Nan
2018-04-24          2           Nan
2018-05-22          2           Nan
2019-02-02          1           Nan
2019-01-28          3           Nan
2019-05-07          1           Nan
2018-12-11          3           Nan
2017-04-28          3           Nan

I want to fill the visit_occurrence_id for measurement table with visit_occurrence_id of visit table on the basis of person_id and nearest date possible.我想根据person_id和可能的最近日期,用访问表的visit_occurrence_id填充测量表的visit_occurrence_id。

I have written a code but its taking a lot of time.我已经写了一个代码,但它需要很多时间。

measurement has 7*10^5 rows.测量有 7*10^5 行。

Note: visit_start_date and measurement_date are object types注意:visit_start_date 和measurement_date 是对象类型

my code - 

import datetime as dt

unique_person_list = measurement['person_id'].unique().tolist()

def nearest_date(row,date_list):
    date_list = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in date_list]
    row = min(date_list, key=lambda x: abs(x - row))
    return row

modified_measurement = pd.DataFrame(columns = measurement.columns)

for person in unique_person_list:
    near_visit_dates =  visit[visit['person_id']==person]['visit_start_date'].tolist()
    if near_visit_dates:
        near_visit_dates = list(filter(None, near_visit_dates))
        near_visit_dates = [i.strftime('%Y-%m-%d') for i in near_visit_dates]
        store_dates = measurement.loc[measurement['person_id']== person]['measurement_date']
        store_dates= store_dates.apply(nearest_date, args=(near_visit_dates,))
        modified_measurement = modified_measurement.append(store_dates)

My code's execution time is quite high.我的代码的执行时间相当长。 Can you help me in either reducing the time complexity or with another solution.您能帮我减少时间复杂度或使用其他解决方案吗?

edit - adding dataframe constructors.编辑 - 添加数据框构造函数。

import numpy as np

measurement = {'measurement_date':['2017-09-04', '2018-04-24', '2018-05-22', '2019-02-02', 
                                   '2019-01-28', '2019-05-07', '2018-12-11','2017-04-28'],
        'person_id':[1, 2, 2, 1, 3, 1, 3, 3],'visit_occurrence_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}

visit = {'visit_occurrence_id':[1, 2, 3, 4, 5], 
         'visit_start_date':['2016-06-01', '2019-05-01', '2016-01-22', '2017-02-14', '2018-05-11'],
         'person_id':[1, 2, 1, 2, 3]}

# Create DataFrame
measurement = pd.DataFrame(measurement)
visit = pd.DataFrame(visit)

You can do the following:您可以执行以下操作:

df=pd.merge(measurement[["person_id", "measurement_date"]], visit, on="person_id", how="inner")

df["dt_diff"]=df[["visit_start_date", "measurement_date"]].apply(lambda x: abs(datetime.datetime.strptime(x["visit_start_date"], '%Y-%m-%d').date() - datetime.datetime.strptime(x["measurement_date"], '%Y-%m-%d').date()), axis=1)

df=pd.merge(df, df.groupby(["person_id", "measurement_date"])["dt_diff"].min(), on=["person_id", "dt_diff", "measurement_date"], how="inner")

res=pd.merge(measurement, df, on=["measurement_date", "person_id"], suffixes=["", "_2"])[["measurement_date", "person_id", "visit_occurrence_id_2"]]

Output:输出:

  measurement_date  person_id  visit_occurrence_id_2
0       2017-09-04          1                      1
1       2018-04-24          2                      2
2       2018-05-22          2                      2
3       2019-02-02          1                      1
4       2019-01-28          3                      5
5       2019-05-07          1                      1
6       2018-12-11          3                      5
7       2017-04-28          3                      5

Here's what I've come up with:这是我想出的:

# Get all visit start dates
df = measurement.drop('visit_occurrence_id', axis=1).merge(visit, on='person_id')
df['date_difference'] = abs(df.measurement_date - df.visit_start_date)
# Find the smallest visit start date for each person_id - measurement_date pair
df['smallest_difference'] = df.groupby(['person_id', 'measurement_date'])['date_difference'].transform(min)
df = df[df.date_difference == df.smallest_difference]
df = df[['measurement_date', 'person_id', 'visit_occurrence_id']]
# Fill in visit_occurrence_id from original dataframe
measurement.drop("visit_occurrence_id", axis=1).merge(
    df, on=["measurement_date", "person_id"]
)

This produces:这产生:

|    | measurement_date   |   person_id |   visit_occurrence_id |
|---:|:-------------------|------------:|----------------------:|
|  0 | 2017-09-04         |           1 |                     1 |
|  1 | 2018-04-24         |           2 |                     2 |
|  2 | 2018-05-22         |           2 |                     2 |
|  3 | 2019-02-02         |           1 |                     1 |
|  4 | 2019-01-28         |           3 |                     5 |
|  5 | 2019-05-07         |           1 |                     1 |
|  6 | 2018-12-11         |           3 |                     5 |
|  7 | 2017-04-28         |           3 |                     5 |

I believe there's probably a cleaner way of writing this using sklearn: https://scikit-learn.org/stable/modules/neighbors.html我相信使用 sklearn 可能有一种更简洁的编写方式: https ://scikit-learn.org/stable/modules/neighbors.html

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从另一个数据框中识别最近的日期? - how to recognise nearest date from another dataframe? Python Pandas 如何比较一个 Dataframe 中的日期与另一个 ZC699575A5E8AFD9E22A7ECC8CAB 中的日期? - Python Pandas how to compare date from one Dataframe with dates in another Dataframe? 如何用 python 中另一个 dataframe 的值仅填充一个 dataframe 列中的缺失值? - How to fill only missing values in one dataframe column with values from another dataframe in python? 从另一个数据帧填充数据帧的列 - Fill column of a dataframe from another dataframe 如何将特定值从一个 dataframe 填充到另一个 dataframe - How to fill specific values from one dataframe to another dataframe 从另一列列表中的特定值填充一个数据框列 - Fill one Dataframe Column from specific value in list of another column 如何过滤一个数据帧与另一个数据帧最近的时间? - How to filter one dataframe with the nearest time to another? 如何在 dataframe 列中获取离年底最近的日期? - How can I get nearest dates to year end in dataframe column? 检查一个 dataframe 中的日期是否在另一个 dataframe 中的两个日期之间,按组 - Check if date in one dataframe is between two dates in another dataframe, by group 熊猫:检查一个数据框的日期是否在另一个数据框的两个日期之间,并吸收值 - Pandas: check if date from one dataframe is between two dates from another dataframe and sobstitute values
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM