[英]Speed up operations over Python Pandas dataframes
I would like to speed up a loop over a python Pandas Dataframe.我想加速 python Pandas Dataframe 上的循环。 Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions.
不幸的是,几十年来使用低级语言意味着我经常很难找到预先打包的解决方案。 Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.
注意:数据是私有的,但如果有帮助,我会看看我是否可以制造一些东西并将其添加到编辑中。
The code has three pandas dataframes: drugUseDF
, tempDF
, which holds the data, and tempDrugUse
, which stores what's been retrieved.该代码包含三个 pandas 数据帧:
drugUseDF
、 tempDF
保存数据, tempDrugUse
存储检索到的内容。 I look over every row of tempDF
(there will be several million rows), retrieving the prodcode
identified from each row and then using that to retrieve the corresponding value from use1
column in the drugUseDF
.我查看
tempDF
的每一行(将有几百万行),检索从每一行标识的prodcode
,然后使用它从drugUseDF
use1
中检索相应的值。 I've added comments to help navigate.我添加了评论以帮助导航。
This is the structure of the dataframes:这是数据框的结构:
tempDF临时文件
patid eventdate consid prodcode issueseq
0 20001 21/04/2005 2728 85 0
1 25001 21/10/2000 3939 40 0
2 25001 21/02/2001 3950 37 0
drugUseDF药物使用DF
index prodcode ... use1 use2
0 171 479 ... diabetes NaN
1 172 9105 ... diabetes NaN
2 173 5174 ... diabetes NaN
tempDrugUse临时用药
use1
0 NaN
1 NaN
2 NaN
This is the code:这是代码:
dfList = []
# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
#predine dataframe where we will store the results to be the same length as the main data dataframe.
tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])
#go through each row of the main data dataframe.
for ind in range(len(tempDF)):
#retrieve the prodcode from the *ind* row of the main data dataframe
prodcodeStr = tempDF.iloc[ind]["prodcode"]
#get the corresponding value from the use1 column matching the prodcode column
useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]
#update the storing dataframe
tempDrugUse.iloc[ind]["use1"] = useStr
print("[DEBUG] End of loop for use1")
dfList.append(tempDrugUse)
The order of the data matters.数据的顺序很重要。 I can't retrieve multiple rows by matching the prodcode because each row has a date column.
我无法通过匹配产品代码来检索多行,因为每一行都有一个日期列。 Retrieving multiple rows and adding them to the
tempDrugUse
dataframe could mean that the rows are no longer in chronological date order.检索多行并将它们添加到
tempDrugUse
dataframe 可能意味着这些行不再按时间顺序排列。
When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages).当尝试合并两个数据帧中的数据时,您应该使用 合并(类似于 sql-like 语言中的 JOIN)。 Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible.
性能方面,您永远不应该遍历行 - 您应该尽可能使用 pandas 内置方法。 Ordering can be achieved with the sort_values method.
可以使用sort_values方法进行排序。
If I understand you correctly, you want to map the prodcode from both tables.如果我理解正确,您想要 map 两个表中的产品代码。 You can do this via pd.merge (please note the example in the code below differs from your data):
您可以通过 pd.merge 执行此操作(请注意以下代码中的示例与您的数据不同):
tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.