简体   繁体   中英

Pandas - expanding inverse quantile function

I have a dataframe of values:

df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
            a         b
1    0.277438  0.042671
..        ...       ...
499  0.570952  0.865869

[500 rows x 2 columns]

I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows . ie, if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.

So the goal is this guy:

      a   b
0    99  99
..   ..  ..
499  58  84

( Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)

I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:

percentile_boundaries_over_time = pd.DataFrame({integer: 
                                     pd.expanding_quantile(df.T.unstack(), integer/100.0) 
                                     for integer in range(0,101,1)})

percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)

for integer in range(0,100,1):
    percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
                    (df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer

I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:

perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))

Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!

As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore . Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.

Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.

I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.

This is a hand-rolled solution:

def quantiles_by_row(df):
    """ Reconstruct a DataFrame of expanding quantiles by row """

    # Construct skeleton of DataFrame what we'll fill with quantile values
    quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)

    # Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
    num_valid = np.sum(~np.isnan(df.values))
    sorted_array = np.empty(num_valid)

    # We want to maintain that sorted_array[:length] has data and is sorted
    length = 0

    # Iterates over ndarray rows
    for i, row_array in enumerate(df.values):

        # Extract non-NaN numpy array from row
        row_is_nan = np.isnan(row_array)
        add_array = row_array[~row_is_nan]

        # Add new data to our sorted_array and sort.
        new_length = length + len(add_array)
        sorted_array[length:new_length] = add_array
        length = new_length
        sorted_array[:length].sort(kind="mergesort")

        # Query the relative positions, divide by length to get quantiles
        quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length

        # Insert values into quantile_df
        quantile_df.iloc[i][~row_is_nan] = quantile_row

    return quantile_df

Based on the data that bhalperin provided (offline), this solution is up to 10x faster.

One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right' :

# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)

Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeeze ing seems to help in getting correct results:

a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)

for current_index in df.index:
    preceding_rows = df.loc[:current_index, :]
    # Combine values from all columns into a single 1D array
    #   * 2 should be * N if you have N columns
    combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
    a_percentile[current_index] = stats.percentileofscore(
        combined, 
        df.loc[current_index, 'a'], 
        kind='weak'
    )
    b_percentile[current_index] = stats.percentileofscore(
        combined, 
        df.loc[current_index, 'b'], 
        kind='weak'
    )

Yet not quite clear, but do you want a cumulative sum divided by total?

norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm

ditto for b

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM