[英]How can I make my loop much faster/better for this particular problem?
Python beginner here. Python初学者在这里。
Here is my problem: I have aa csv file with roughly 3200 rows and 660 columns. 这是我的问题:我有一个大约3200行和660列的csv文件。 The rows are filled with either 0s, 1s, or 50s.
这些行用0、1或50填充。
I need to update the newly created column 'answer' by these requirements: 我需要通过以下要求更新新创建的列“答案”:
so, for example, the row [1, 0, 0, 0, 1, 1, 50, 0, 0, 0, 1] should have a new value at the end of it as '3' because we found three 1s before finding a 50. 因此,例如,行[1、0、0、0、1、1、50、0、0、0、1]在其末尾应具有新值“ 3”,因为我们之前发现了三个1找到50。
Here's my code: 这是我的代码:
df_numRows = len(df.values)
df_numCols = len(df.columns)
for row in range(df_numRows):
df_sum = 0
for col in range(df_numCols):
if '50' not in df.values[row]:
df.at[row, 'answer'] = '0'
elif df.values[row][col] == '0':
continue
elif df.values[row][col] == '1':
df_sum += 1
df.at[row, 'answer'] = df_sum
elif df.values[row][col] == '50':
break
I wrote this nested for loop to iterate through my Pandas dataframe but it seems to take a VERY long time to run. 我写了这个嵌套的for循环来遍历我的Pandas数据框,但是似乎要花很长时间才能运行。
I ran this piece of code on the same dataset but with only 100 rows x 660 columns and it took about 1.5 mins, however, when I try to run it on the entire thing, it ran for about 2.5 hours and I just shut it down because I thought it had taken too long. 我在同一数据集上运行了这段代码,但只有100行x 660列,大约花了1.5分钟,但是,当我尝试在整个程序上运行它时,它运行了大约2.5个小时,我只是将其关闭因为我认为这花了太长时间。
How can I make my code more efficient/faster/better? 如何使我的代码更高效/更快/更好? I would love any help at all from you guys, and I apologize in advance if this is an easy question but I am just getting started in Python!
我希望你们能提供任何帮助,如果这是一个简单的问题,我预先表示歉意,但是我才刚刚开始使用Python!
Thanks guys! 多谢你们!
Just do cumprod
after we find the 50, if it is 50 we all values below will become 0 , then we using this Boolean dataframe filter the original df , and do sum
找到50后只做
cumprod
,如果是50我们下面的所有值都将变为0,然后我们使用此布尔数据帧过滤原始df并sum
df=pd.DataFrame({'A':[1, 0, 0, 0, 1, 1, 50, 0, 0, 0, 1] })
df.mul(df.ne(50).cumprod()).sum()
Out[35]:
A 3
dtype: int64
df = pd.DataFrame([
[1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1], # No 50s
[1, 0, 0, 0, 1, 1, 50, 0, 0, 0, 1], # One 50
[1, 50, 0, 0, 1, 50, 50, 0, 0, 0, 1], # Three 50s but 2 are consecutive
[1, 50, 0, 0, 1, 1, 50, 0, 0, 0, 1], # Two 50s
])
df
0 1 2 3 4 5 6 7 8 9 10
0 1 0 0 0 1 1 0 0 0 0 1
1 1 0 0 0 1 1 50 0 0 0 1
2 1 50 0 0 1 50 50 0 0 0 1
3 1 50 0 0 1 1 50 0 0 0 1
logical_and
with its accumulate
method logical_and
及其accumulate
方法 np.logical_and
will take the and
operator and apply it to a group of booleans. np.logical_and
将使用and
运算符并将其应用于一组布尔值。 The accumulate
part says to keep applying it and as we go keep track of the most recent and
of all prior booleans. accumulate
部分表示要继续应用它,并且在我们进行操作时会跟踪最近的布尔值and
所有以前的布尔值。 By specifying axis=1
I'm saying to do this for each row. 通过指定
axis=1
我是说要为每一行执行此操作。 This returns an array of booleans where the rows are true until we hit the value of 50
. 这将返回一个布尔数组,其中的行为true,直到达到
50
为止。 I then check to see of any are fifty withe all(1)
. 然后,我检查
all(1)
是否有五十个。 The proper multiplication gives the sums of all values not 50 prior to the first 50... for each row. 适当的乘法运算得出的所有行的总和不等于前50 ...之前的50。
d = np.logical_and.accumulate(df.ne(50), axis=1)
df.mul(d).mul(~d.all(1), 0).sum(1)
0 0
1 3
2 1
3 1
dtype: int64
Combine to get new column 合并以获得新列
d = np.logical_and.accumulate(df.ne(50), axis=1)
df.assign(answer=df.mul(d).mul(~d.all(1), 0).sum(1))
0 1 2 3 4 5 6 7 8 9 10 asnswer
0 1 0 0 0 1 1 0 0 0 0 1 0
1 1 0 0 0 1 1 50 0 0 0 1 3
2 1 50 0 0 1 50 50 0 0 0 1 1
3 1 50 0 0 1 1 50 0 0 0 1 1
If you want to go full blown Numpy 如果你想全力以赴
v = df.values
a = np.logical_and.accumulate(v != 50, axis=1)
df.assign(answer=(v * (a & ~a.all(1, keepdims=True))).sum(1))
0 1 2 3 4 5 6 7 8 9 10 asnswer
0 1 0 0 0 1 1 0 0 0 0 1 0
1 1 0 0 0 1 1 50 0 0 0 1 3
2 1 50 0 0 1 50 50 0 0 0 1 1
3 1 50 0 0 1 1 50 0 0 0 1 1
Please try this logic and let me know if this helps. 请尝试这种逻辑,让我知道是否有帮助。
df_numRows = len(df.values)
df_numCols = len(df.columns)
for row in range(df_numRows):
df_sum = 0
try:
indexOf50 = np.argwhere(df.loc[row]==50)[0][0]
colArrayTill50 = df.loc[row][:indexOf50].values
numberOfOne = colArrayTill50.sum()
except:
numberOfOne = 0
print(numberOfOne)
This solves it, though bit robust: 这可以解决它,尽管有点健壮:
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.choice([0, 1, 50], (3200,660)))
data = df.values
idxs = [np.where(d == 50) for d in data]
sums = [sum(d[:i[0][0]]) if i[0].size else 0 for d, i in zip(data, idxs)]
data = np.column_stack((data, sums))
df = df.assign(answer=sums)
df.head()
# 0 1 2 3 4 5 6 7 8 9 ... 651 652 653 654 655 \
#0 1 0 0 1 1 0 0 1 0 1 ... 50 50 1 1 0
#1 1 0 50 1 50 50 0 1 1 50 ... 1 0 1 0 0
#2 50 0 1 0 1 50 1 50 0 50 ... 0 50 1 50 50
#3 0 1 0 50 1 0 0 50 1 0 ... 1 1 0 1 1
#4 1 50 1 1 1 1 0 50 50 1 ... 0 1 0 1 0
#
# 656 657 658 659 answer
#0 0 0 1 0 5
#1 1 50 0 50 1
#2 50 1 1 50 0
#3 0 50 1 50 1
#4 0 50 0 50 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.