简体   繁体   English

如何检查熊猫中另一个数组中存在的数组中值的百分比?

[英]How do I check the % of values in an array that exist in another array in pandas?

I have a DataFrame in pandas that looks like this: 我在熊猫中有一个DataFrame,看起来像这样:

    app_id_x    period  app_id_y
10  [pb6uhl15, xn66n2cr, e68t39yp, s7xun0k1, wab2z...   2015-19 NaN
11  [uscm6kkb, tja4ma8u, qcwhw33w, ux5bbkjz, mmt3s...   2015-20 NaN
12  [txdbauhy, dib24pab, xt69u57g, n9e6a6ol, d9f7m...   2015-21 NaN
13  [21c2b5ca5e7066141b2e2aea35d7253b3b8cce11, oht...   2015-22 [g8m4lecv, uyhsx6lo, u9ue1zzo, kw06m3f5, wvqhq...
14  [64lbiaw3, jum7l6yd, a5d00f6aba8f1505ff22bc1fb...   2015-23 [608a223c57e1174fc64775dd2fd8cda387cc4a47, ze4...
15  [gcg8nc8k, jkrelo7v, g9wqigbc, n806bjdu, piqgv...   2015-24 [kz8udlea, zwqo7j8w, 6d02c9d74b662369dc6c53ccc...
16  [uc311krx, wpd7gm75, am8p0spd, q64dcnlm, idosz...   2015-25 [fgs0qhtf, awkcmpns, e0iraf3a, oht91x5j, mv4uo...
17  [wilhuu0x, b51xiu51, ezt7goqr, qj6w7jh6, pkzkv...   2015-26 [zwqo7j8w, dzdfiof5, phwoy1ea, e7hfx7mu, 40fdd...
18  [xn43bho3, uwtjxy6u, ed65xcuj, ejbgjh61, hbvzt...   2015-27 [ze4rr0vi, kw06m3f5, be532399ca86c053fb0a69d13...

What I want to do, is for each period , which is a row, check the the % of app_id_y values that are also in the list of app_id_x values, for that row eg if ze4rr0vi and gm83klja are within app_id_x which contains 53 values in that row, then there should be a new column called adoption_rate which is: 我想做的事,是每个period ,这是行,检查的百分比app_id_y这也是在列表中值app_id_x值,该行例如,如果ze4rr0vi和gm83klja不到app_id_x包含在53个值行,那么应该有一个称为adoption_rate的新列,该列是:

period   adoption_rate
2015-9      0%
2015-22     3.56%
2015-25     4.56%
2015-26     5.10%
2015-35     4.58%
2015-36     1.23%

How about this: 这个怎么样:

df[adoption_rate] = [100.*len(set(df.loc[i,app_id_x]) &\ 
                     set(df.loc[i,app_id_y]))/len(set(df.loc[i,app_id_x]))\   
                     if type(df.loc[i,app_id_x])==list and \ 
                     type(df.loc[i,app_id_x])==list \
                     else 0. for i in df.index]

Edit: fixed for the case of duplicate values in any of the arrays. 编辑:修复了任何数组中重复值的情况。

You can use numpy.intersect1d to get the common elements between two arrays, which does the bulk of the work that needs to be done. 您可以使用numpy.intersect1d获取两个数组之间的公共元素,这完成了需要完成的大部分工作。 To get the output, I'm going to write a function to get the overlap percent for a given row, and then use apply to add an adoption_rate column. 为了获得输出,我将编写一个函数以获取给定行的重叠百分比,然后使用apply添加adapment_rate列。

def get_overlap_pcnt(row):
    # Get the overlap between arrays.
    overlap = len(np.intersect1d(row['app_id_x'], row['app_id_y']))

    # Compute the percent common.
    if overlap == 0:
        pcnt = 0
    else:
        pcnt = 100*overlap/len(row['app_id_y'])

    return '{:.2f}%'.format(pcnt)

df['adoption_rate'] = df.apply(get_overlap_pcnt, axis=1)

I couldn't quite tell from your question if you wanted app_id_y or app_id_x to be the denominator, but that's an easy enough change to make. 从您的问题中我无法完全确定您是否希望app_id_yapp_id_x作为分母,但这很容易进行更改。 Below is sample output using some sample data I created. 以下是使用我创建的一些示例数据的示例输出。

                app_id_x         app_id_y   period adoption_rate
0  [a, b, c, d, e, f, g]              NaN  2015-08         0.00%
1              [b, c, d]     [b, c, d, e]  2015-09        75.00%
2     [a, b, c, x, y, z]        [x, y, z]  2015-10       100.00%
3     [q, w, e, r, t, y]  [a, b, c, d, e]  2015-11        20.00%
4              [x, y, z]        [a, b, x]  2015-12        33.33%

What the other answers are missing is that this is a really unnatural way to store your data. 其他答案遗漏的是,这是存储数据的一种非常不自然的方法。 In general, the values in a pandas DataFrame should be scalars. 通常,pandas DataFrame中的值应为标量。

A better way to represent your data for the purposes of this problem is to reshape them into two dataframes, X and Y. In X, the rows are periods and the columns are the ids (eg 'g8m4lecv'). 为了解决此问题,一种更好的表示数据的方法是将它们重塑为两个数据框X和Y。在X中,行是句点,列是ID(例如'g8m4lecv')。 The entries in the X data frame are 1 if the value is in your X column in that period, and similarly for Y. 如果该时间段的值在您的X列中,则X数据框中的条目为1 ,Y则类似。

This makes it much easier to perform the kinds of operations you want to do. 这样可以更轻松地执行您想要执行的各种操作。

Here goes: 开始:

import pandas as pd
import numpy as np


# from the comment by @jezrael . Super useful, thanks
df = pd.DataFrame({'app_id_x': {10: ['pb6uhl15', 'pb6uhl15', 'pb6uhl15'], 11: ['pb6uhl15', 'pb6uhl15', 'e68t39yp', 's7xun0k1'], 12: [ 'pb6uhl15', 's7xun0k1'], 13: [ 's7xun0k1'], 14: ['pb6uhl15', 'pb6uhl15', 'e68t39yp', 's7xun0k1']}, 'app_id_y': {10: ['pb6uhl15'], 11: ['pb6uhl15'], 12: np.nan, 13: ['pb6uhl15', 'xn66n2cr', 'e68t39yp', 's7xun0k1'], 14: ['e68t39yp', 'xn66n2cr']}, 'period': {10: '2015-19', 11: '2015-20', 12: '2015-21', 13: '2015-22', 14: '2015-23'}})


# pulling the data out of the lists in the starting dataframe
new_data = []
for _,row in df.iterrows():
    for col in ['app_id_x','app_id_y']:
        vals = row[col]
        if isinstance(vals,list):
            for v in set(vals):
                new_data.append((row['period'],col[-1],v,1))

new_df = pd.DataFrame(new_data, columns = ['period','which_app','val','exists'])

# splitting the data into two frames
def get_one_group(app_id):
    return new_df.groupby('which_app').get_group(app_id).drop('which_app', axis=1)

X = get_one_group('x')
Y = get_one_group('y')


# converting to the desired format
def convert_to_indicator_matrix(df):
    return df.set_index(['period','val']).unstack('val').fillna(0)

X = convert_to_indicator_matrix(X)
Y = convert_to_indicator_matrix(Y)

Now, it's super easy to actually solve your problem. 现在,真正解决您的问题非常容易。 I'm not clear on exactly what you need to solve, but suppose you want to know, for each period, number_ids_in_both divided by number_ids_in_Y . 我不清楚您到底需要解决什么,但是假设您想知道每个时期的number_ids_in_both除以number_ids_in_Y

combined = (X * Y).fillna(0)
combined.sum(axis=1) / Y.sum(axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何检查 np.array 是否存在于另一个 np.array 中? - How do I check if np.array exist inside another np.array? 如何检查 numpy 中数组的值? - How do I check the values of an array in numpy? 如何检查另一个数组中是否存在数组键的值? - How to check if a value of a keys of an array exist in another array? 检查列的值是否在 pandas 中另一个 numpy 数组列的值中 - check if values of a column are in values of another numpy array column in pandas 如何获取一个指示另一个数组的索引的数组,并将这些值存储在另一个数组中? - How do I take an array indicating the indicies of another array and store those values in yet another array? 检查列的值是否在熊猫数据框中的另一个列数组中 - Check if values of a column is in another column array in a pandas dataframe 如何检查数组中所有值的 decimal.is_nan() ? - how do I check decimal.is_nan() for all values in array? 在没有数组帮助的情况下,如何检查 3/5 值是否相同? - How do I check if 3/5 values are the same without the help of an array? 使用 pandas,如何检查列中是否存在特定序列? - Using pandas, how do I check if a particular sequence exist in a column? 如何在另一个数组中保存的每个数组中检查特定的精确匹配值? 蟒蛇 - How do i check for a specific exact match value in each array that is held within another array? Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM