[英]Is there a efficient way to bypass a nested for loop?
I've got a nested for loop, and I'm wondering if there's a more efficient way to do this, code-wise: 我有一个嵌套的for循环,我想知道是否有更有效的方法来执行此操作,代码方面:
My data looks similar to the following. 我的数据类似于以下内容。
ID | DEAD | 2009-10 | ... | 2016-10
-----------------------------------------
1 | 2018-11 | 5.4 | ... | 6.5
2 | 2014-01 | 0.5 | ... | 5.2
...
N | 2008-11 | 8.6 | ... | 1.3
The goal is to replace the values with np.NaN
as soon as a product expires (when column 'DEAD' < date), otherwise the values should remain the same. 目标是在产品到期时立即用
np.NaN
替换值(当列'DEAD'<日期时),否则值应保持不变。
ID | DEAD | 2009-10 | ... | 2016-10
-----------------------------------------
1 | 2018-11 | 5.4 | ... | 6.5
2 | 2014-01 | 0.5 | ... | NaN
...
N | 2008-11 | 8.6 | ... | NaN
My initial idea was to apply a nested for loop to check whether the condition 'DEAD' < date
is reached. 我最初的想法是应用嵌套的for循环来检查是否达到条件
'DEAD' < date
。 The method works for smaller N. But since my data includes over 20,000 rows and 400 columns it requires too much time. 该方法适用于较小的N.但由于我的数据包括超过20,000行和400列,因此需要太多时间。
time = df.columns[2:] # take the header as an index
time = pd.DataFrame(time)
time.columns = ['Dummy']
time['Dummy'] = pd.to_datetime(time.Dummy) # Convert index argument to datetime
df['DEAD'] = pd.to_datetime(tore.DEAD) # Convert column 'DEAD' to datetime
lists = []
for i in range(397):
row = []
for j in range(20000):
if time.iloc[i,0] <= df.iloc[j,0]:
newlist = df.iloc[j,i]
else:
newlist = np.NaN
row.append(newlist)
lists.append(row)
lists = pd.DataFrame(lists)
lists = lists.transpose()
Appreciate any suggestions! 感谢任何建议!
You can try to iterate through each column instead: 您可以尝试迭代每列:
for column_name in df.drop('DEAD', axis=1):
column_date = pd.to_datetime(column_name)
df[column_name].mask(df['DEAD']<column_date, inplace=True)
The mask method is also useful here. 掩码方法在这里也很有用。
If your columns are ordered - for example, ascending order by date - then you could avoid some of the looping and checking. 如果您的列是有序的 - 例如,按日期升序 - 那么您可以避免一些循环和检查。
i
i
index >= i
to the NaN
value index >= i
所有后续列更新为NaN
值 The update itself is still being done cell-by-cell, which might not perform particularly well. 更新本身仍然是逐个单元地完成的,这可能不是特别好。
You might get better performance if you create a second dataframe with the same dimensions that could be used like a bitmask, containing 0
and 1
values indicating whether the value in the underlying dataframe should be retained or removed. 如果您创建第二个具有相同维度的数据帧(如位掩码),则可能会获得更好的性能,其中包含
0
和1
值,指示是否应保留或删除基础数据帧中的值。
如果这些数据存储在数据库中,您应该直接使用sql,更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.