简体   繁体   English

如果没有逐行迭代数据帧,这需要很长时间,我如何检查许多行是否都满足条件?

[英]Without iterating row by row through a dataframe, which takes ages, how can I check that a number of rows all meet a condition?

I want to do the following, but obviously I realise that this kind of iterative method is very slow with large DataFrames, what other solutions are there to this problem?:我想做以下事情,但显然我意识到这种迭代方法对于大型 DataFrames 非常慢,还有什么其他解决方案可以解决这个问题?:

for i in range(len(df)):
    for n in range(1001):
        if df["Close"][(i+n)] > df["MA"][i+n]:
            df["Strategy 1"][i] = "Buy"

What I would expect the code above to do is:我希望上面的代码做的是:

Sub in n from 0 to 1,000 into line 3, with an i of 0 , and then if the condition in line 3 held for each n in the range of 0 to 1,000 then it would go on and carry out the operation in line 4.n 从 0 到 1,000 代入第 3 行,其中i 为 0 ,然后如果第 3 行中的条件对于 0 到 1,000 范围内的每个 n 都成立,那么它将继续执行第 4 行中的操作。

After this it would take i of 1 and then sub in n from 0 to 1,000 into line 3, and if the condition held for all n in that range then it would carry out line 4.在此之后,它将把i为 1 ,然后将n 从 0 到 1,000放入第 3 行,如果该条件适用于该范围内的所有 n,则它将执行第 4 行。

After this it would take i of 2 and then sub in n from 0 to 1,000 into line 3, and if the condition held for all n in that range then it would carry out line 4.在此之后,它将取i 为 2 ,然后将n 从 0 到 1,000放入第 3 行,如果该条件适用于该范围内的所有 n,则它将执行第 4 行。

After this it would take i of 3 and then sub in n from 0 to 1,000 into line 3, and if the condition held for all n in that range then it would carry out line 4.在此之后,它将取3 中的 i,然后将n 从 0 到 1,000放入第 3 行,如果该条件适用于该范围内的所有 n,则它将执行第 4 行。

... ... ......

After this it would take i of len(df) and then sub in n from 0 to 1,000 into line 3, and if the condition held for all n in that range then it would carry out line 4.在此之后,它将使用len(df) 的 i ,然后将n 从 0 到 1,000放入第 3 行,如果该条件适用于该范围内的所有 n,则它将执行第 4 行。

Regardless of if the code presented above does what i'd expect or not, is there a much faster way to compute this for very large multi Gigabyte DataFrames?不管上面提供的代码是否符合我的预期,对于非常大的多 GB 数据帧,是否有更快的方法来计算它?

Using the .apply function would be faster.使用 .apply 函数会更快。 For a general example...对于一般示例...

import pandas as pd

# only required to create the test dataframe in this example
import numpy as np

# create a dataframe for testing using the numpy import above
df = pd.DataFrame(np.random.randint(100,size=(10, )),columns=['A'])

# create a new column based on column 'A' but moving the column 'across and up'
df['NextRow'] = df['A'].shift(-1)

# create a function to do something, anything, and return that thing
def doMyThingINeedToDo(num, numNext):
#     'num' is going to be the value of whatever is in column 'A' per row 
#     as the .apply function runs below and 'numNext' is plus one.
    if num >= 50 and numNext >= 75:
        return 'Yes'
    else:
        return '...No...'

# create a new column called 'NewColumnName' based on the existing column 'A' and apply the
# function above, whatever it does, to the frame per row.
df['NewColumnName'] = df.apply(lambda row : doMyThingINeedToDo(row['A'], row['NextRow']), axis = 1)

# output the frame and notice the new column
print(df)

Outputs:输出:

    A  NextRow NewColumnName
0  67     84.0           Yes
1  84     33.0      ...No...
2  33     59.0      ...No...
3  59     85.0           Yes
4  85     39.0      ...No...
5  39     81.0      ...No...
6  81     76.0           Yes
7  76     83.0           Yes
8  83     60.0      ...No...
9  60      NaN      ...No...

The main point is that you can separate what exactly you want to do per row and contain it in a function (that can be tweaked and updated as required) and just call that function for all rows on a frame when required.主要的一点是,您可以将每行具体要做的事情分开,并将其包含在一个函数中(可以根据需要进行调整和更新),并在需要时为帧上的所有行调用该函数。

You can accomplish what you are attempting with only your close data.您可以仅使用接近的数据来完成您正在尝试的操作。 Calculating the MA and 1000 conditions on the fly via vectorization.通过矢量化动态计算 MA 和 1000 条件。 Maybe try this:也许试试这个:

import numpy as np

ma_window = 1000 
n = 1000 

df['Strategy 1'] = \
    np.where( \
        (df['close'] > df['close'].rolling(window=ma_window).mean()).rolling(window=n).mean() == 1, \
             'buy','')
         

Play around with this and see if it might work for you.试试这个,看看它是否适合你。

在此处输入图片说明

First, let me state how I understand your rule.首先,让我说明我如何理解你的规则。 As near as I can tell you are trying to get a value of "Buy" in the "Strategy 1" column of the df only if there are 1000 consecutive cases where MA was greater than the Close preceding that time.我可以告诉您,只有在连续 1000 次MA大于该时间之前的Close价的情况下,您才会尝试在 df 的“策略 1”列中获得“买入”值。 I think you can get that done simply by using a rolling sum on the comparison:我认为您可以通过在比较中使用滚动总和来完成:

import pandas as pd
import numpy as np

# build some repeatable sample data
np.random.seed(1)
df = pd.DataFrame({'close': np.cumsum(np.random.randn(10000))})
df['MA'] = df['close'].rolling(1000).mean()

# Apply strategy
npoints = 1000

df['Strategy 1'] = float('nan')
buypoints = (df['MA'] > df['close']).rolling(npoints).sum() == npoints
df.loc[buypoints, "Strategy 1"] = "Buy"

# just for visualisation show where the Buys would be
df['Buypoints'] = buypoints*10
df.plot()

This comes out like this (with the same seed it should look the same on your machine too)这是这样的(使用相同的种子,它在您的机器上也应该看起来相同)

显示购买点的样本曲线

Iteration is a last resort with Pandas.迭代是 Pandas 的最后手段。

The solution you are looking for is coming from numpy:您正在寻找的解决方案来自 numpy:

import numpy as np
df["Strategy 1"] = np.where(df["Close"] > df["MA"], "Buy", df["Strategy 1"])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过遍历行来预测 dataframe 中的每一行? - how can I predict for each row in the dataframe by iterating through the rows? 在Pandas数据框中满足特定条件的所有行的均值 - Mean of all rows which meet a certain condition in Pandas dataframe 如何逐行比较记录并删除不符合我条件的行? - How can I compare records row by row and remove one row that does not meet my condition? 在迭代每一行时如何维护数据帧的结构(当前将 df 转换为系列)? - How can I maintain a dataframe's structure when iterating through each row(currently converting df to series)? 使用相邻行计算 Pandas Dataframe 中的列而不遍历每一行 - Calculate column in Pandas Dataframe using adjacent rows without iterating through each row 如何在pandas.DataFrame中插入满足条件的行值 - How to insert row values that meet a condition in pandas.DataFrame 在满足同一行中的初始条件后迭代Pandas行 - Iterating through Pandas rows after an initial condition in the same row is met 删除满足条件的数据框行的一半 - Deleting half of dataframe rows which meet condition 如何修改满足条件的值以下一行的值? - How to modify values which are one row below the values that meet a condition? 遍历数据框中选定列的行以“清理”每一行 - Iterating through rows of selected column in dataframe to “clean” each row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM