简体   繁体   中英

using numpy broadcasting / vectorization to build new array from other arrays

I am working on a stock ranking factor for a Quantopian model. They recommend avoiding the use of loops in custom factors. However, I am not exactly sure how I would avoid the loops in this case.

def GainPctInd(offset=0, nbars=2):  
    class GainPctIndFact(CustomFactor):  
        window_length = nbars + offset  
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]  
        def compute(self, today, assets, out, close, industries):
            # Compute the gain percents for all stocks
            asset_gainpct = (close[-1] - close[offset]) / close[offset] * 100  

            # For each industry, build a list of the per-stock gains over the given window  
            gains_by_industry = {}  
            for i in range(0, len(industries)):  
                industry = industries[0,i]  
                if industry in gains_by_industry:  
                    gains_by_industry[industry].append(asset_gainpct[i])  
                else:  
                    gains_by_industry[industry] = [asset_gainpct[i]]

            # Loop through each stock's industry and compute a mean value for that  
            # industry (caching it for reuse) and return that industry mean for  
            # that stock  
            mean_cache = {}  
            for i in range(0, len(industries)):  
                industry = industries[0,i]  
                if not industry in mean_cache:  
                    mean_cache[industry] = np.mean(gains_by_industry[industry])  
                out[i] = mean_cache[industry]  
    return GainPctIndFact()

When the compute function is called, assets is a 1-d array of the asset names, close is a multi-dimensional numpy array where there are window_length close prices for each asset listed in assets (using the same index numbers), and industries is the list of industry codes associated with each asset in a 1-d array. I know numpy vectorizes the computation of the gainpct in this line:

asset_gainpct = (close[-1] - close[offset]) / close[offset] * 100

The result is that asset_gainpct is a 1-d array of all the computed gains for every stock. The part I am unclear about is how I would use numpy to finish the calculations without me manually looping through the arrays.

Basically, what I need to do is aggregate all of the gains for all of the stocks based on the industry they are in, then compute the average of those values, and then de-aggregate the averages back out to the full list of assets.

Right now, I am looping through all the industries and pushing the gain percentages into a industry-indexed dictionary storing a list of the gains per industry. Then I am calculating the mean for those lists and performing a reverse-industry lookup to map the industry gains to each asset based on their industry.

It seems to me like this should be possible to do using some highly optimized traversals of the arrays in numpy, but I can't seem to figure it out. I've never used numpy before today, and I'm fairly new to Python, so that probably doesn't help.


UPDATE:

I modified my industry code loop to try to handle the computation with a masked array using the industry array to mask the asset_gainpct array like such:

    # For each industry, build a list of the per-stock gains over the given window 
    gains_by_industry = {}
    for industry in industries.T:
        masked = ma.masked_where(industries != industry[0], asset_gainpct)
        np.nanmean(masked, out=out)

It gave me the following error:

IndexError: Inconsistant shape between the condition and the input (got (20, 8412) and (8412,))

Also, as a side note, industries is coming in as a 20x8412 array because the window_length is set to 20. The extra values are the industry codes for the stocks on the previous days, except they don't typically change, so they can be ignored. I am now iterating over industries.T (the transpose of industries) which means industry is a 20-element array with the same industry code in each element. Hence, I only need element 0.

The error above is coming from the ma.masked_where() call. The industries array is 20x8412 so I presume asset_gainpct is the one listed as (8412,). How do I make these compatible for this call to work?


UPDATE 2:

I have modified the code again, fixing several other issues I have run into. It now looks like this:

    # For each industry, build a list of the per-stock gains over the given window 
    unique_ind = np.unique(industries[0,])
    for industry in unique_ind:
        masked = ma.masked_where(industries[0,] != industry, asset_gainpct)
        mean = np.full_like(masked, np.nanmean(masked), dtype=np.float64, subok=False)
        np.copyto(out, mean, where=masked)

Basically, the new premise here is that I have to build a mean-value filled array of the same size as the number of stocks in my input data and then copy the values into my destination variable ( out ) while applying my previous mask so that only the unmasked indexes are filled with the mean value. In addition, I realized that I was iterating over industries more than once in my previous incarnation, so I fixed that, too. However, the copyto() call is yielding this error:

TypeError: Cannot cast array data from dtype('float64') to dtype('bool') according to the rule 'safe'

Obviously, I am doing something wrong; but looking through the docs, I don't see what it is. This looks like it should be copying from mean (which is np.float64 dtype) to out (which I have not previously defined) and it should be using masked as the boolean array for selecting which indexes get copied. Anyone have any ideas on what the issue is?


UPDATE 3:

First, thanks for all the feedback from everyone who contributed.

After much additional digging into this code, I have come up with the following:

def GainPctInd(offset=0, nbars=2):
    class GainPctIndFact(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            num_bars, num_assets = close.shape
            newest_bar_idx = (num_bars - 1) - offset
            oldest_bar_idx = newest_bar_idx - (nbars - 1)

            # Compute the gain percents for all stocks
            asset_gainpct = ((close[newest_bar_idx] - close[oldest_bar_idx]) / close[oldest_bar_idx]) * 100

            # For each industry, build a list of the per-stock gains over the given window 
            unique_ind = np.unique(industries[0,])
            for industry in unique_ind:
                ind_view = asset_gainpct[industries[0,] == industry]
                ind_mean = np.nanmean(ind_view)
                out[industries[0,] == industry] = ind_mean
    return GainPctIndFact()

For some reason, the calculations based on the masked views were not yielding correct results. Further, getting those results into the out variable was not working. Somewhere along the line, I stumbled on a post about how numpy (by default) creates views of arrays instead of copies when you do a slice and that you can do a sparse slice based on a Boolean condition. When running a calculation on such a view, it looks like a full array as far as the calculation is concerned, but all the values are still actually in the base array. It's sort of like having an array of pointers and the calculations happen on the data the pointers point to. Similarly, you can assign a value to all nodes in your sparse view and have it update the data for all of them. This actually simplified the logic considerably.

I would still be interested in any ideas anyone has on how to remove the final loop over the industries and vectorize that process. I am wondering if maybe a map / reduce approach might work, but I am still not familiar enough with numpy to figure out how to do it any more efficiently than this FOR loop. On the bright side, the remaining loop only has about 140 iterations to go through vs the two prior loops which would go through 8000 each. In addition to that, I am now avoiding the construction of the gains_by_industry and the mean_cache dict and avoiding all the data copying which went with them. So, it is not just faster, it is also far more memory efficient.


UPDATE 4:

Someone gave me a more succinct way to accomplish this, finally eliminating the extra FOR loop. It basically hides the loop in a Pandas DataFrame groupby, but it more succinctly describes what the desired steps are:

def GainPctInd2(offset=0, nbars=2):
    class GainPctIndFact2(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            df = pd.DataFrame(index=assets, data={
                    "gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
                    "industry_codes": industries[-1]
                 })
            out[:] = df.groupby("industry_codes").transform(np.mean).values.flatten()
    return GainPctIndFact2()

It does not improve the efficiency at all, according to my benchmarks, but it's probably easier to verify correctness. The one problem with their example is that it uses np.mean instead of np.nanmean , and np.nanmean drops the NaN values resulting in a shape mismatch if you try to use it. To fix the NaN issue, someone else suggested this:

def GainPctInd2(offset=0, nbars=2):
    class GainPctIndFact2(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            df = pd.DataFrame(index=assets, data={
                    "gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
                    "industry_codes": industries[-1]
                 })
            nans = isnan(df['industry_codes'])
            notnan = ~nans
            out[notnan] = df[df['industry_codes'].notnull()].groupby("industry_codes").transform(np.nanmean).values.flatten()
            out[nans] = nan
    return GainPctIndFact2()

Someone gave me a more succinct way to accomplish this, finally eliminating the extra FOR loop. It basically hides the loop in a Pandas DataFrame groupby, but it more succinctly describes what the desired steps are:

def GainPctInd2(offset=0, nbars=2):
    class GainPctIndFact2(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            df = pd.DataFrame(index=assets, data={
                    "gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
                    "industry_codes": industries[-1]
                 })
            out[:] = df.groupby("industry_codes").transform(np.mean).values.flatten()
    return GainPctIndFact2()

It does not improve the efficiency at all, according to my benchmarks, but it's probably easier to verify correctness. The one problem with their example is that it uses np.mean instead of np.nanmean , and np.nanmean drops the NaN values resulting in a shape mismatch if you try to use it. To fix the NaN issue, someone else suggested this:

def GainPctInd2(offset=0, nbars=2):
    class GainPctIndFact2(CustomFactor):
        window_length = nbars + offset
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]
        def compute(self, today, assets, out, close, industries):
            df = pd.DataFrame(index=assets, data={
                    "gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,
                    "industry_codes": industries[-1]
                 })
            nans = isnan(df['industry_codes'])
            notnan = ~nans
            out[notnan] = df[df['industry_codes'].notnull()].groupby("industry_codes").transform(np.nanmean).values.flatten()
            out[nans] = nan
    return GainPctIndFact2()

– user36048

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM