簡體   English   中英

根據另一個的值更新數據框

[英]Update a dataframe based on the values of another

我有一個由ID和日期組成的數據框。 一個ID可能有多個日期 - ID按每個ID的日期排序。

AccidentDates

我的第二個數據框包括ID,開始日期,完成日期,布爾列事故(表示發生事故)和時間到事件列。 最后兩列最初設置為0.再次對ID進行排序以及每個ID的時間間隔。

PatientLog

我想根據第一個數據幀記錄的事故更新第二個數據幀的兩列。 如果兩個數據幀上都存在ID(它不必),請檢查在第二個數據幀的任何時間間隔內是否記錄了任何事故。

如果有,找到它發生的間隔,將Accident列更新為1,Time = df1.Date - df2.Start。 如果不是,請為患者的該條目設置Accident = 0和Time = df2.Finish-df2.Start。

我設法通過列表和循環來做到這一點。 但是,我想知道是否有更聰明的方法,因為數據量巨大,整個過程需要很多才能完成。 提前致謝!

# Temporary lists
df1list = []
df2list = []

# Change format from dataframe to list
for row in df1.itertuples(index=True, name='Pandas'):

    # Get Patient ID and the date of the recorded accident
    df1list.append([getattr(row, "Patient"), getattr(row, "regdatum")])


# Change format from dataframe to list
for row in df2.itertuples(index=True, name='Pandas'):

    # Get Patient ID, info, occurrence of accident and time to event
    df2list.append([getattr(row, "Patient"), getattr(row, "Start"), getattr(row, "Finish"), getattr(row, "Gender"),
                   getattr(row, "Age"), getattr(row, "Accident"), getattr(row, "Time")])


#For each interval of each patient
for i in range(0, len(df2list)):

    #For each recorded accident of each patient
    for j in range(0, len(df1list)):

        #If there's a match in both lists
        if df2list[i][0] == df1list[j][0]:

            #If the recorded date is in between the time interval
            if (df1list[j][1] >= datetime.strptime(df2list[i][1], '%Y-%m-%d')) & (df1list[j][1] <= datetime.strptime(df2list[i][2], '%Y-%m-%d')):

                #Change the accident column to 1 and calculate the time to event
                #The extra if is to verify that this is the recorded accident is the first one to have happened within the time interval (if there are multiple, we only keep the first one)    
                if df2list[i][6] == 0 :
                    df2list[i][6] = 1
                    df2list[i][7] = df1list[j][1] - datetime.strptime(df2list[i][1], '%Y-%m-%d')

#Back to dfs
labels = ['Patient', 'Start', 'Finish', 'Gender', 'Age', 'Accident', 'Time']
df = pd.DataFrame.from_records(df2list, columns=labels)
```

這是我將如何做到這一點。

# Define a pair of functions that return the list of unique start and end dates for a given patient
def start_dates(patient):
    try:
        return df2.loc[df2['Patient'] == patient]['Start'].unique()
    except:
        return np.datetime64("NaT")

def finish_dates(patient):
    try:
        return df2.loc[df2['Patient'] == patient]['Finish'].unique()
    except:
        return np.datetime64("NaT")

# Add and fill 'Start' and 'Finish' columns to df1
df1['Start'] = list(zip(df1['Patient'], df1['Accident Date']))
df1['Start'] = df1['Start'].apply(lambda x: max([d for d in start_dates(x[0]) if d <= np.datetime64(x[1])]))
df1['Finish'] = list(zip(df1['Patient'], df1['Accident Date']))
df1['Finish'] = df1['Finish'].apply(lambda x: min([d for d in finish_dates(x[0]) if d >= np.datetime64(x[1])]))

# Merge the two DataFrames
df2 = df2.merge(df1, how='outer')

# Fill the 'Accident' column appropriately, and convert to int
df2['Accident'] = ~pd.isna(df2.iloc[:,5])
df2 = df2.astype({'Accident': int})

# Fill NaT fields in 'Accident Date' with 'Finish'
df2 = df2.fillna({'Accident Date': df2['Finish']})

# Fill 'Time' appropriately
df2['Time'] = df2['Accident Date'] - df2['Start']

# Drop the 'Accident Date' column
df2 = df2.drop(columns=['Accident Date'])

這適用於我創建的一些虛擬數據,我認為它應該適用於您的。 我懷疑這是最有效的做事方式(我遠不是熊貓專家)但我認為它通常比使用循環更好。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM