简体   繁体   English

如何解决这个特定问题,使循环更快/更好?

[英]How can I make my loop much faster/better for this particular problem?

Python beginner here. Python初学者在这里。

Here is my problem: I have aa csv file with roughly 3200 rows and 660 columns. 这是我的问题:我有一个大约3200行和660列的csv文件。 The rows are filled with either 0s, 1s, or 50s. 这些行用0、1或50填充。

I need to update the newly created column 'answer' by these requirements: 我需要通过以下要求更新新创建的列“答案”:

  1. It should be the sum of 1s in that row that happen before a '50' occurs. 它应该是该行中发生“ 50”之前1的总和。
  2. If there is no '50' in that row, just update the last column to a zero. 如果该行中没有“ 50”,只需将最后一列更新为零。

so, for example, the row [1, 0, 0, 0, 1, 1, 50, 0, 0, 0, 1] should have a new value at the end of it as '3' because we found three 1s before finding a 50. 因此,例如,行[1、0、0、0、1、1、50、0、0、0、1]在其末尾应具有新值“ 3”,因为我们之前发现了三个1找到50。

Here's my code: 这是我的代码:

df_numRows = len(df.values)
df_numCols = len(df.columns)

for row in range(df_numRows):
    df_sum = 0
    for col in range(df_numCols):
        if '50' not in df.values[row]:
            df.at[row, 'answer'] = '0'
        elif df.values[row][col] == '0':
            continue
        elif df.values[row][col] == '1':
            df_sum += 1
            df.at[row, 'answer'] = df_sum
        elif df.values[row][col] == '50':
            break

I wrote this nested for loop to iterate through my Pandas dataframe but it seems to take a VERY long time to run. 我写了这个嵌套的for循环来遍历我的Pandas数据框,但是似乎要花很长时间才能运行。

I ran this piece of code on the same dataset but with only 100 rows x 660 columns and it took about 1.5 mins, however, when I try to run it on the entire thing, it ran for about 2.5 hours and I just shut it down because I thought it had taken too long. 我在同一数据集上运行了这段代码,但只有100行x 660列,大约花了1.5分钟,但是,当我尝试在整个程序上运行它时,它运行了大约2.5个小时,我只是将其关闭因为我认为这花了太长时间。

How can I make my code more efficient/faster/better? 如何使我的代码更高效/更快/更好? I would love any help at all from you guys, and I apologize in advance if this is an easy question but I am just getting started in Python! 我希望你们能提供任何帮助,如果这是一个简单的问题,我预先表示歉意,但是我才刚刚开始使用Python!

Thanks guys! 多谢你们!

Just do cumprod after we find the 50, if it is 50 we all values below will become 0 , then we using this Boolean dataframe filter the original df , and do sum 找到50后只做cumprod ,如果是50我们下面的所有值都将变为0,然后我们使用此布尔数据帧过滤原始df并sum

df=pd.DataFrame({'A':[1, 0, 0, 0, 1, 1, 50, 0, 0, 0, 1] })
df.mul(df.ne(50).cumprod()).sum()
Out[35]: 
A    3
dtype: int64

Setup 设定

df = pd.DataFrame([
    [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1],    # No 50s
    [1, 0, 0, 0, 1, 1, 50, 0, 0, 0, 1],   # One 50
    [1, 50, 0, 0, 1, 50, 50, 0, 0, 0, 1], # Three 50s but 2 are consecutive
    [1, 50, 0, 0, 1, 1, 50, 0, 0, 0, 1],  # Two 50s
])

df

   0   1   2   3   4   5   6   7   8   9   10
0   1   0   0   0   1   1   0   0   0   0   1
1   1   0   0   0   1   1  50   0   0   0   1
2   1  50   0   0   1  50  50   0   0   0   1
3   1  50   0   0   1   1  50   0   0   0   1

Use logical_and with its accumulate method 使用logical_and及其accumulate方法

np.logical_and will take the and operator and apply it to a group of booleans. np.logical_and将使用and运算符并将其应用于一组布尔值。 The accumulate part says to keep applying it and as we go keep track of the most recent and of all prior booleans. accumulate部分表示要继续应用它,并且在我们进行操作时会跟踪最近的布尔值and所有以前的布尔值。 By specifying axis=1 I'm saying to do this for each row. 通过指定axis=1我是说要为每一行执行此操作。 This returns an array of booleans where the rows are true until we hit the value of 50 . 这将返回一个布尔数组,其中的行为true,直到达到50为止。 I then check to see of any are fifty withe all(1) . 然后,我检查all(1)是否有五十个。 The proper multiplication gives the sums of all values not 50 prior to the first 50... for each row. 适当的乘法运算得出的所有行的总和不等于前50 ...之前的50。

d = np.logical_and.accumulate(df.ne(50), axis=1)

df.mul(d).mul(~d.all(1), 0).sum(1)

0    0
1    3
2    1
3    1
dtype: int64

Combine to get new column 合并以获得新列

d = np.logical_and.accumulate(df.ne(50), axis=1)

df.assign(answer=df.mul(d).mul(~d.all(1), 0).sum(1))

   0   1  2  3  4   5   6  7  8  9  10  asnswer
0  1   0  0  0  1   1   0  0  0  0   1        0
1  1   0  0  0  1   1  50  0  0  0   1        3
2  1  50  0  0  1  50  50  0  0  0   1        1
3  1  50  0  0  1   1  50  0  0  0   1        1

If you want to go full blown Numpy 如果你想全力以赴

v = df.values
a = np.logical_and.accumulate(v != 50, axis=1)
df.assign(answer=(v * (a & ~a.all(1, keepdims=True))).sum(1))

   0   1  2  3  4   5   6  7  8  9  10  asnswer
0  1   0  0  0  1   1   0  0  0  0   1        0
1  1   0  0  0  1   1  50  0  0  0   1        3
2  1  50  0  0  1  50  50  0  0  0   1        1
3  1  50  0  0  1   1  50  0  0  0   1        1

Please try this logic and let me know if this helps. 请尝试这种逻辑,让我知道是否有帮助。

df_numRows = len(df.values)
df_numCols = len(df.columns)

for row in range(df_numRows):
    df_sum = 0

    try:
        indexOf50 = np.argwhere(df.loc[row]==50)[0][0]
        colArrayTill50 = df.loc[row][:indexOf50].values
        numberOfOne = colArrayTill50.sum()
    except:
        numberOfOne = 0

    print(numberOfOne)

This solves it, though bit robust: 这可以解决它,尽管有点健壮:

import pandas as pd
import numpy as np

np.random.seed(1)

df = pd.DataFrame(np.random.choice([0, 1, 50], (3200,660)))

data = df.values
idxs = [np.where(d == 50) for d in data]
sums = [sum(d[:i[0][0]]) if i[0].size else 0 for d, i in zip(data, idxs)]

data = np.column_stack((data, sums))

df = df.assign(answer=sums)

df.head()

#    0   1   2   3   4   5  6   7   8   9   ...    651  652  653  654  655  \
#0   1   0   0   1   1   0  0   1   0   1   ...     50   50    1    1    0   
#1   1   0  50   1  50  50  0   1   1  50   ...      1    0    1    0    0   
#2  50   0   1   0   1  50  1  50   0  50   ...      0   50    1   50   50   
#3   0   1   0  50   1   0  0  50   1   0   ...      1    1    0    1    1   
#4   1  50   1   1   1   1  0  50  50   1   ...      0    1    0    1    0   
#
#   656  657  658  659  answer  
#0    0    0    1    0       5  
#1    1   50    0   50       1  
#2   50    1    1   50       0  
#3    0   50    1   50       1  
#4    0   50    0   50       1  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM