简体   繁体   English

使用numpy广播/矢量化从其他数组构建新数组

[英]using numpy broadcasting / vectorization to build new array from other arrays

I am working on a stock ranking factor for a Quantopian model. 我正在研究Quantopian模型的股票排名因子。 They recommend avoiding the use of loops in custom factors. 他们建议避免在自定义因素中使用循环。 However, I am not exactly sure how I would avoid the loops in this case. 但是,我不确定在这种情况下如何避免循环。

def GainPctInd(offset=0, nbars=2):  
    class GainPctIndFact(CustomFactor):  
        window_length = nbars + offset  
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]  
        def compute(self, today, assets, out, close, industries):
            # Compute the gain percents for all stocks
            asset_gainpct = (close[-1] - close[offset]) / close[offset] * 100  

            # For each industry, build a list of the per-stock gains over the given window  
            gains_by_industry = {}  
            for i in range(0, len(industries)):  
                industry = industries[0,i]  
                if industry in gains_by_industry:  
                    gains_by_industry[industry].append(asset_gainpct[i])  
                else:  
                    gains_by_industry[industry] = [asset_gainpct[i]]

            # Loop through each stock's industry and compute a mean value for that  
            # industry (caching it for reuse) and return that industry mean for  
            # that stock  
            mean_cache = {}  
            for i in range(0, len(industries)):  
                industry = industries[0,i]  
                if not industry in mean_cache:  
                    mean_cache[industry] = np.mean(gains_by_industry[industry])  
                out[i] = mean_cache[industry]  
    return GainPctIndFact()

When the compute function is called, assets is a 1-d array of the asset names, close is a multi-dimensional numpy array where there are window_length close prices for each asset listed in assets (using the same index numbers), and industries is the list of industry codes associated with each asset in a 1-d array. 当计算函数被调用, 资产是资产名称的1-d阵列, 靠近是多维numpy的阵列,其中存在用于在资产中列出的每个资产window_length靠近价格(使用相同的索引号),和工业是与1-d数组中的每个资产关联的行业代码列表。 I know numpy vectorizes the computation of the gainpct in this line: 我知道numpy将此行中的gainpct的计算向量化:

asset_gainpct = (close[-1] - close[offset]) / close[offset] * 100

The result is that asset_gainpct is a 1-d array of all the computed gains for every stock. 结果是asset_gainpct是每个股票的所有计算收益的一维数组。 The part I am unclear about is how I would use numpy to finish the calculations without me manually looping through the arrays. 我不清楚的部分是如何在没有手动循环数组的情况下使用numpy来完成计算。

Basically, what I need to do is aggregate all of the gains for all of the stocks based on the industry they are in, then compute the average of those values, and then de-aggregate the averages back out to the full list of assets. 基本上,我需要做的是根据他们所在的行业汇总所有股票的所有收益,然后计算这些价值的平均值,然后将平均值反汇总回到完整的资产清单。

Right now, I am looping through all the industries and pushing the gain percentages into a industry-indexed dictionary storing a list of the gains per industry. 现在,我正在遍历所有行业并将增益百分比推进到行业索引字典中,该字典存储了每个行业的收益列表。 Then I am calculating the mean for those lists and performing a reverse-industry lookup to map the industry gains to each asset based on their industry. 然后我计算这些列表的平均值并执行逆向行业查找,以根据行业将行业收益映射到每个资产。

It seems to me like this should be possible to do using some highly optimized traversals of the arrays in numpy, but I can't seem to figure it out. 在我看来,这应该可以在numpy中使用一些高度优化的数组遍历,但我似乎无法弄明白。 I've never used numpy before today, and I'm fairly new to Python, so that probably doesn't help. 我从来没有在今天使用numpy,而且我对Python很新,所以这可能没有帮助。


UPDATE: 更新:

I modified my industry code loop to try to handle the computation with a masked array using the industry array to mask the asset_gainpct array like such: 我修改了我的行业代码循环,尝试使用行业数组来处理带掩码数组的计算,以掩盖asset_gainpct数组,如下所示:

    # For each industry, build a list of the per-stock gains over the given window 
    gains_by_industry = {}
    for industry in industries.T:
        masked = ma.masked_where(industries != industry[0], asset_gainpct)
        np.nanmean(masked, out=out)

It gave me the following error: 它给了我以下错误:

IndexError: Inconsistant shape between the condition and the input (got (20, 8412) and (8412,)) IndexError:条件和输入之间的不一致形状(got(20,8412)和(8412,))

Also, as a side note, industries is coming in as a 20x8412 array because the window_length is set to 20. The extra values are the industry codes for the stocks on the previous days, except they don't typically change, so they can be ignored. 此外,作为旁注, 行业以20x8412阵列形式出现,因为window_length设置为20.额外值是前几天股票的行业代码,除非它们通常不会改变,因此它们可以是忽略。 I am now iterating over industries.T (the transpose of industries) which means industry is a 20-element array with the same industry code in each element. 我现在正在迭代行业.T(行业的转置),这意味着行业是一个20元素的阵列,每个元素都有相同的行业代码。 Hence, I only need element 0. 因此,我只需要元素0。

The error above is coming from the ma.masked_where() call. 上面的错误来自ma.masked_where()调用。 The industries array is 20x8412 so I presume asset_gainpct is the one listed as (8412,). industries数组是20x8412所以我认为asset_gainpct是列为(8412,)的那个。 How do I make these compatible for this call to work? 如何使这些呼叫兼容?


UPDATE 2: 更新2:

I have modified the code again, fixing several other issues I have run into. 我已经修改了代码,修复了我遇到的其他几个问题。 It now looks like this: 它现在看起来像这样:

    # For each industry, build a list of the per-stock gains over the given window 
    unique_ind = np.unique(industries[0,])
    for industry in unique_ind:
        masked = ma.masked_where(industries[0,] != industry, asset_gainpct)
        mean = np.full_like(masked, np.nanmean(masked), dtype=np.float64, subok=False)
        np.copyto(out, mean, where=masked)

Basically, the new premise here is that I have to build a mean-value filled array of the same size as the number of stocks in my input data and then copy the values into my destination variable ( out ) while applying my previous mask so that only the unmasked indexes are filled with the mean value. 基本上,这里的新前提是我必须构建一个与我的输入数据中的股票数量相同大小的平均值填充数组,然后在应用我之前的掩码时将值复制到我的目标变量( out )中,以便只有未屏蔽的索引用平均值填充。 In addition, I realized that I was iterating over industries more than once in my previous incarnation, so I fixed that, too. 此外,我意识到我在之前的化身中不止一次地重复行业,所以我也解决了这个问题。 However, the copyto() call is yielding this error: 但是,copyto()调用会产生此错误:

TypeError: Cannot cast array data from dtype('float64') to dtype('bool') according to the rule 'safe' TypeError:根据规则'safe',无法将数组数据从dtype('float64')转换为dtype('bool')

Obviously, I am doing something wrong; 显然,我做错了; but looking through the docs, I don't see what it is. 但通过文档查看,我看不出它是什么。 This looks like it should be copying from mean (which is np.float64 dtype) to out (which I have not previously defined) and it should be using masked as the boolean array for selecting which indexes get copied. 这看起来应该是从均值 (这是np.float64 dtype)复制到out (我之前没有定义),它应该使用masked作为布尔数组来选择复制哪些索引。 Anyone have any ideas on what the issue is? 任何人对这个问题有什么想法?


UPDATE 3: 更新3:

First, thanks for all the feedback from everyone who contributed. 首先,感谢所有贡献者的反馈。

After much additional digging into this code, I have come up with the following: 在进一步深入研究这段代码之后,我想出了以下内容:

def GainPctInd(offset=0, nbars=2):
    class GainPctIndFact(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            num_bars, num_assets = close.shape
            newest_bar_idx = (num_bars - 1) - offset
            oldest_bar_idx = newest_bar_idx - (nbars - 1)

            # Compute the gain percents for all stocks
            asset_gainpct = ((close[newest_bar_idx] - close[oldest_bar_idx]) / close[oldest_bar_idx]) * 100

            # For each industry, build a list of the per-stock gains over the given window 
            unique_ind = np.unique(industries[0,])
            for industry in unique_ind:
                ind_view = asset_gainpct[industries[0,] == industry]
                ind_mean = np.nanmean(ind_view)
                out[industries[0,] == industry] = ind_mean
    return GainPctIndFact()

For some reason, the calculations based on the masked views were not yielding correct results. 出于某种原因,基于蒙版视图的计算没有产生正确的结果。 Further, getting those results into the out variable was not working. 此外,将这些结果输入out变量是行不通的。 Somewhere along the line, I stumbled on a post about how numpy (by default) creates views of arrays instead of copies when you do a slice and that you can do a sparse slice based on a Boolean condition. 在某个地方,我偶然发现了一篇关于numpy(默认情况下)如何在执行切片时创建数组视图而不是副本的帖子以及您可以根据布尔条件执行稀疏切片的帖子。 When running a calculation on such a view, it looks like a full array as far as the calculation is concerned, but all the values are still actually in the base array. 在这样的视图上运行计算时,就计算而言,它看起来像一个完整的数组,但所有的值仍然实际上在基本数组中。 It's sort of like having an array of pointers and the calculations happen on the data the pointers point to. 它有点像指针数组,计算发生在指针所指向的数据上。 Similarly, you can assign a value to all nodes in your sparse view and have it update the data for all of them. 同样,您可以为稀疏视图中的所有节点分配一个值,并让它更新所有节点的数据。 This actually simplified the logic considerably. 这实际上大大简化了逻辑。

I would still be interested in any ideas anyone has on how to remove the final loop over the industries and vectorize that process. 我仍然会对任何人有关如何删除行业的最终循环以及向量化该过程的任何想法感兴趣。 I am wondering if maybe a map / reduce approach might work, but I am still not familiar enough with numpy to figure out how to do it any more efficiently than this FOR loop. 我想知道是否可能有一个map / reduce方法可行,但是我仍然不熟悉numpy来弄清楚如何比这个FOR循环更有效地做到这一点。 On the bright side, the remaining loop only has about 140 iterations to go through vs the two prior loops which would go through 8000 each. 从好的方面来看,剩余的循环只有大约140次迭代才能通过两个先前的循环,每个循环将通过8000次。 In addition to that, I am now avoiding the construction of the gains_by_industry and the mean_cache dict and avoiding all the data copying which went with them. 除此之外,我现在避免构建gain_by_industrymean_cache dict并避免随之而来的所有数据复制。 So, it is not just faster, it is also far more memory efficient. 因此,它不仅速度更快,而且内存效率更高。


UPDATE 4: 更新4:

Someone gave me a more succinct way to accomplish this, finally eliminating the extra FOR loop. 有人给了我一个更简洁的方法来完成这个,最后消除了额外的FOR循环。 It basically hides the loop in a Pandas DataFrame groupby, but it more succinctly describes what the desired steps are: 它基本上隐藏了Pandas DataFrame组中的循环,但它更简洁地描述了所需的步骤:

def GainPctInd2(offset=0, nbars=2):
    class GainPctIndFact2(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            df = pd.DataFrame(index=assets, data={
                    "gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
                    "industry_codes": industries[-1]
                 })
            out[:] = df.groupby("industry_codes").transform(np.mean).values.flatten()
    return GainPctIndFact2()

It does not improve the efficiency at all, according to my benchmarks, but it's probably easier to verify correctness. 根据我的基准测试,它根本没有提高效率,但验证正确性可能更容易。 The one problem with their example is that it uses np.mean instead of np.nanmean , and np.nanmean drops the NaN values resulting in a shape mismatch if you try to use it. 他们的例子的一个问题是它使用np.mean而不是np.nanmean ,并且np.nanmean会丢弃NaN值,如果你尝试使用它会导致形状不匹配。 To fix the NaN issue, someone else suggested this: 为了解决NaN问题,其他人建议:

def GainPctInd2(offset=0, nbars=2):
    class GainPctIndFact2(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            df = pd.DataFrame(index=assets, data={
                    "gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
                    "industry_codes": industries[-1]
                 })
            nans = isnan(df['industry_codes'])
            notnan = ~nans
            out[notnan] = df[df['industry_codes'].notnull()].groupby("industry_codes").transform(np.nanmean).values.flatten()
            out[nans] = nan
    return GainPctIndFact2()

Someone gave me a more succinct way to accomplish this, finally eliminating the extra FOR loop. 有人给了我一个更简洁的方法来完成这个,最后消除了额外的FOR循环。 It basically hides the loop in a Pandas DataFrame groupby, but it more succinctly describes what the desired steps are: 它基本上隐藏了Pandas DataFrame组中的循环,但它更简洁地描述了所需的步骤:

def GainPctInd2(offset=0, nbars=2):
    class GainPctIndFact2(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            df = pd.DataFrame(index=assets, data={
                    "gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
                    "industry_codes": industries[-1]
                 })
            out[:] = df.groupby("industry_codes").transform(np.mean).values.flatten()
    return GainPctIndFact2()

It does not improve the efficiency at all, according to my benchmarks, but it's probably easier to verify correctness. 根据我的基准测试,它根本没有提高效率,但验证正确性可能更容易。 The one problem with their example is that it uses np.mean instead of np.nanmean , and np.nanmean drops the NaN values resulting in a shape mismatch if you try to use it. 他们的例子的一个问题是它使用np.mean而不是np.nanmean ,并且np.nanmean会丢弃NaN值,如果你尝试使用它会导致形状不匹配。 To fix the NaN issue, someone else suggested this: 为了解决NaN问题,其他人建议:

def GainPctInd2(offset=0, nbars=2):
    class GainPctIndFact2(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            df = pd.DataFrame(index=assets, data={
                    "gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
                    "industry_codes": industries[-1]
                 })
            nans = isnan(df['industry_codes'])
            notnan = ~nans
            out[notnan] = df[df['industry_codes'].notnull()].groupby("industry_codes").transform(np.nanmean).values.flatten()
            out[nans] = nan
    return GainPctIndFact2()

– user36048 - user36048

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Numpy中的矢量化 - 广播 - Vectorization in Numpy - Broadcasting Numpy 数组索引与其他数组会产生广播错误 - Numpy array Indexing with other arrays yields broadcasting error 将元素作为另外两个 numpy 数组形成 numpy 数组时出现广播错误 - Broadcasting error when forming numpy array with elements as two other numpy arrays 如何根据其他两个 numpy arrays 仅使用 Z2EA9510C37F7F821ECBFZF24 操作的条件获得新的 numpy 阵列? - How to get a new numpy array based on conditions of two other numpy arrays using only numpy operations? 为什么与使用两个 Numpy arrays 的向量化相比,使用 Numpy 数组和 int 进行算术运算时减法更快? - Why is subtraction faster when doing arithmetic with a Numpy array and a int compared to using vectorization with two Numpy arrays? 您可以使用 3 个单独的 1D numpy arrays 来使用矢量化操作 3D 数组吗? - Can you use 3 seperate 1D numpy arrays to manipulate a 3D array using vectorization? 如何使用“in”运算符比较两个numpy字符串数组以使用数组广播获取布尔数组? - How to compare two numpy arrays of strings with the “in” operator to get a boolean array using array broadcasting? Numpy广播数组 - Numpy Broadcasting arrays 在numpy中广播数组 - broadcasting arrays in numpy 在多个阵列上广播 - Numpy broadcasting on multiple arrays
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM