[英]How to speed up nested loop in python DataFrame?
I have a pandas.DataFrame containing information of different securities.我有一个 pandas.DataFrame 包含不同证券的信息。 There are columns: "date", "security_id", "country", "factor_name" and "factor_value", where "factor_name" indicates whether the "factor_value" is "debt" or "equity".
有列:“date”、“security_id”、“country”、“factor_name”和“factor_value”,其中“factor_name”表示“factor_value”是“debt”还是“equity”。 I am asked to calculate the debt-to-equity ratio for each security at each country at each date.
我被要求计算每个国家每个证券在每个日期的债务权益比率。 I can only think of using a nested loop to loop through the unique values of each columns, but it seems to take forever to run.
我只能想到使用嵌套循环来遍历每列的唯一值,但它似乎需要永远运行。 Is there any way I can speed up my code?
有什么办法可以加快我的代码?
dates = data["date"].unique()
securities = data["security_id"].unique()
countries = data["country"].unique()
for date in dates:
for sec in securities:
for country in countries:
ratio = get_DEratio(date, sec, country)
def get_DEratio(date, sec, country):
TE_lst = data[(data["date"] == date) & (data["security_id"] == sec)
& (data["country"] == country) & (data["factor"] == "TE")]["factor_value"].tolist()
TD_lst = data[(data["date"] == date) & (data["security_id"] == sec)
& (data["country"] == country) & (data["factor"] == "TD")]["factor_value"].tolist()
if not TD_lst or not TE_lst:
return 0
TD, TE = TD_lst[0], TE_lst[0]
if TD == 0 or TE == 0:
return 0
return TD / TE
Assume that your source DataFrame contains:假设您的源 DataFrame 包含:
date security_id country factor_name factor_value
0 2020-06-01 S1 C1 TE 10.0
1 2020-06-01 S1 C1 TD 20.0
2 2020-06-01 S2 C1 TE 12.0
3 2020-06-01 S2 C1 TD 20.0
4 2020-06-01 S1 C2 TE 12.0
5 2020-06-01 S1 C2 TD 20.0
6 2020-06-01 S2 C2 TE 14.0
7 2020-06-01 S2 C2 TD 20.0
8 2020-06-01 S3 C2 TE 14.0
9 2020-06-01 S4 C2 TD 20.0
First compute an auxiliary DataFrame:首先计算一个辅助 DataFrame:
wrk = df.set_index(['date', 'security_id', 'country', 'factor_name'])\
.factor_value.unstack()
The result is:结果是:
factor_name TD TE
date security_id country
2020-06-01 S1 C1 20.0 10.0
C2 20.0 12.0
S2 C1 20.0 12.0
C2 20.0 14.0
S3 C2 NaN 14.0
S4 C2 20.0 NaN
Then, to get the final result, run:然后,要获得最终结果,请运行:
result = wrk.TD.div(wrk.TE).fillna(0)
and you will get:你会得到:
date security_id country
2020-06-01 S1 C1 2.000000
C2 1.666667
S2 C1 1.666667
C2 1.428571
S3 C2 0.000000
S4 C2 0.000000
dtype: float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.