简体   繁体   English

加快 Python Pandas 数据帧上的操作

[英]Speed up operations over Python Pandas dataframes

I would like to speed up a loop over a python Pandas Dataframe.我想加速 python Pandas Dataframe 上的循环。 Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions.不幸的是,几十年来使用低级语言意味着我经常很难找到预先打包的解决方案。 Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.注意:数据是私有的,但如果有帮助,我会看看我是否可以制造一些东西并将其添加到编辑中。

The code has three pandas dataframes: drugUseDF , tempDF , which holds the data, and tempDrugUse , which stores what's been retrieved.该代码包含三个 pandas 数据帧: drugUseDFtempDF保存数据, tempDrugUse存储检索到的内容。 I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF .我查看tempDF的每一行(将有几百万行),检索从每一行标识的prodcode ,然后使用它从drugUseDF use1中检索相应的值。 I've added comments to help navigate.我添加了评论以帮助导航。

This is the structure of the dataframes:这是数据框的结构:

tempDF临时文件

   patid   eventdate consid prodcode issueseq
0  20001  21/04/2005   2728       85        0
1  25001  21/10/2000   3939       40        0
2  25001  21/02/2001   3950       37        0

drugUseDF药物使用DF

   index prodcode  ...      use1 use2
0    171      479  ...  diabetes  NaN
1    172     9105  ...  diabetes  NaN
2    173     5174  ...  diabetes  NaN

tempDrugUse临时用药

  use1
0  NaN
1  NaN
2  NaN

This is the code:这是代码:

dfList = []                

# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
         
    #predine dataframe where we will store the results to be the same length as the main data dataframe.     
    tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])    

    #go through each row of the main data dataframe.
    for ind in range(len(tempDF)): 

        #retrieve the prodcode from the *ind* row of the main data dataframe
        prodcodeStr = tempDF.iloc[ind]["prodcode"]

        #get the corresponding value from the use1 column matching the prodcode column 
        useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]

        #update the storing dataframe
        tempDrugUse.iloc[ind]["use1"] = useStr

    print("[DEBUG] End of loop for use1")
    dfList.append(tempDrugUse)

The order of the data matters.数据的顺序很重要。 I can't retrieve multiple rows by matching the prodcode because each row has a date column.我无法通过匹配产品代码来检索多行,因为每一行都有一个日期列。 Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.检索多行并将它们添加到tempDrugUse dataframe 可能意味着这些行不再按时间顺序排列。

When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages).当尝试合并两个数据帧中的数据时,您应该使用 合并(类似于 sql-like 语言中的 JOIN)。 Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible.性能方面,您永远不应该遍历行 - 您应该尽可能使用 pandas 内置方法。 Ordering can be achieved with the sort_values method.可以使用sort_values方法进行排序。

If I understand you correctly, you want to map the prodcode from both tables.如果我理解正确,您想要 map 两个表中的产品代码。 You can do this via pd.merge (please note the example in the code below differs from your data):您可以通过 pd.merge 执行此操作(请注意以下代码中的示例与您的数据不同):

tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
                       'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
                          'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM