简体   繁体   中英

How do I check the % of values in an array that exist in another array in pandas?

I have a DataFrame in pandas that looks like this:

    app_id_x    period  app_id_y
10  [pb6uhl15, xn66n2cr, e68t39yp, s7xun0k1, wab2z...   2015-19 NaN
11  [uscm6kkb, tja4ma8u, qcwhw33w, ux5bbkjz, mmt3s...   2015-20 NaN
12  [txdbauhy, dib24pab, xt69u57g, n9e6a6ol, d9f7m...   2015-21 NaN
13  [21c2b5ca5e7066141b2e2aea35d7253b3b8cce11, oht...   2015-22 [g8m4lecv, uyhsx6lo, u9ue1zzo, kw06m3f5, wvqhq...
14  [64lbiaw3, jum7l6yd, a5d00f6aba8f1505ff22bc1fb...   2015-23 [608a223c57e1174fc64775dd2fd8cda387cc4a47, ze4...
15  [gcg8nc8k, jkrelo7v, g9wqigbc, n806bjdu, piqgv...   2015-24 [kz8udlea, zwqo7j8w, 6d02c9d74b662369dc6c53ccc...
16  [uc311krx, wpd7gm75, am8p0spd, q64dcnlm, idosz...   2015-25 [fgs0qhtf, awkcmpns, e0iraf3a, oht91x5j, mv4uo...
17  [wilhuu0x, b51xiu51, ezt7goqr, qj6w7jh6, pkzkv...   2015-26 [zwqo7j8w, dzdfiof5, phwoy1ea, e7hfx7mu, 40fdd...
18  [xn43bho3, uwtjxy6u, ed65xcuj, ejbgjh61, hbvzt...   2015-27 [ze4rr0vi, kw06m3f5, be532399ca86c053fb0a69d13...

What I want to do, is for each period , which is a row, check the the % of app_id_y values that are also in the list of app_id_x values, for that row eg if ze4rr0vi and gm83klja are within app_id_x which contains 53 values in that row, then there should be a new column called adoption_rate which is:

period   adoption_rate
2015-9      0%
2015-22     3.56%
2015-25     4.56%
2015-26     5.10%
2015-35     4.58%
2015-36     1.23%

How about this:

df[adoption_rate] = [100.*len(set(df.loc[i,app_id_x]) &\ 
                     set(df.loc[i,app_id_y]))/len(set(df.loc[i,app_id_x]))\   
                     if type(df.loc[i,app_id_x])==list and \ 
                     type(df.loc[i,app_id_x])==list \
                     else 0. for i in df.index]

Edit: fixed for the case of duplicate values in any of the arrays.

You can use numpy.intersect1d to get the common elements between two arrays, which does the bulk of the work that needs to be done. To get the output, I'm going to write a function to get the overlap percent for a given row, and then use apply to add an adoption_rate column.

def get_overlap_pcnt(row):
    # Get the overlap between arrays.
    overlap = len(np.intersect1d(row['app_id_x'], row['app_id_y']))

    # Compute the percent common.
    if overlap == 0:
        pcnt = 0
    else:
        pcnt = 100*overlap/len(row['app_id_y'])

    return '{:.2f}%'.format(pcnt)

df['adoption_rate'] = df.apply(get_overlap_pcnt, axis=1)

I couldn't quite tell from your question if you wanted app_id_y or app_id_x to be the denominator, but that's an easy enough change to make. Below is sample output using some sample data I created.

                app_id_x         app_id_y   period adoption_rate
0  [a, b, c, d, e, f, g]              NaN  2015-08         0.00%
1              [b, c, d]     [b, c, d, e]  2015-09        75.00%
2     [a, b, c, x, y, z]        [x, y, z]  2015-10       100.00%
3     [q, w, e, r, t, y]  [a, b, c, d, e]  2015-11        20.00%
4              [x, y, z]        [a, b, x]  2015-12        33.33%

What the other answers are missing is that this is a really unnatural way to store your data. In general, the values in a pandas DataFrame should be scalars.

A better way to represent your data for the purposes of this problem is to reshape them into two dataframes, X and Y. In X, the rows are periods and the columns are the ids (eg 'g8m4lecv'). The entries in the X data frame are 1 if the value is in your X column in that period, and similarly for Y.

This makes it much easier to perform the kinds of operations you want to do.

Here goes:

import pandas as pd
import numpy as np


# from the comment by @jezrael . Super useful, thanks
df = pd.DataFrame({'app_id_x': {10: ['pb6uhl15', 'pb6uhl15', 'pb6uhl15'], 11: ['pb6uhl15', 'pb6uhl15', 'e68t39yp', 's7xun0k1'], 12: [ 'pb6uhl15', 's7xun0k1'], 13: [ 's7xun0k1'], 14: ['pb6uhl15', 'pb6uhl15', 'e68t39yp', 's7xun0k1']}, 'app_id_y': {10: ['pb6uhl15'], 11: ['pb6uhl15'], 12: np.nan, 13: ['pb6uhl15', 'xn66n2cr', 'e68t39yp', 's7xun0k1'], 14: ['e68t39yp', 'xn66n2cr']}, 'period': {10: '2015-19', 11: '2015-20', 12: '2015-21', 13: '2015-22', 14: '2015-23'}})


# pulling the data out of the lists in the starting dataframe
new_data = []
for _,row in df.iterrows():
    for col in ['app_id_x','app_id_y']:
        vals = row[col]
        if isinstance(vals,list):
            for v in set(vals):
                new_data.append((row['period'],col[-1],v,1))

new_df = pd.DataFrame(new_data, columns = ['period','which_app','val','exists'])

# splitting the data into two frames
def get_one_group(app_id):
    return new_df.groupby('which_app').get_group(app_id).drop('which_app', axis=1)

X = get_one_group('x')
Y = get_one_group('y')


# converting to the desired format
def convert_to_indicator_matrix(df):
    return df.set_index(['period','val']).unstack('val').fillna(0)

X = convert_to_indicator_matrix(X)
Y = convert_to_indicator_matrix(Y)

Now, it's super easy to actually solve your problem. I'm not clear on exactly what you need to solve, but suppose you want to know, for each period, number_ids_in_both divided by number_ids_in_Y .

combined = (X * Y).fillna(0)
combined.sum(axis=1) / Y.sum(axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM