[英]How do I check the % of values in an array that exist in another array in pandas?
I have a DataFrame in pandas that looks like this: 我在熊猫中有一个DataFrame,看起来像这样:
app_id_x period app_id_y
10 [pb6uhl15, xn66n2cr, e68t39yp, s7xun0k1, wab2z... 2015-19 NaN
11 [uscm6kkb, tja4ma8u, qcwhw33w, ux5bbkjz, mmt3s... 2015-20 NaN
12 [txdbauhy, dib24pab, xt69u57g, n9e6a6ol, d9f7m... 2015-21 NaN
13 [21c2b5ca5e7066141b2e2aea35d7253b3b8cce11, oht... 2015-22 [g8m4lecv, uyhsx6lo, u9ue1zzo, kw06m3f5, wvqhq...
14 [64lbiaw3, jum7l6yd, a5d00f6aba8f1505ff22bc1fb... 2015-23 [608a223c57e1174fc64775dd2fd8cda387cc4a47, ze4...
15 [gcg8nc8k, jkrelo7v, g9wqigbc, n806bjdu, piqgv... 2015-24 [kz8udlea, zwqo7j8w, 6d02c9d74b662369dc6c53ccc...
16 [uc311krx, wpd7gm75, am8p0spd, q64dcnlm, idosz... 2015-25 [fgs0qhtf, awkcmpns, e0iraf3a, oht91x5j, mv4uo...
17 [wilhuu0x, b51xiu51, ezt7goqr, qj6w7jh6, pkzkv... 2015-26 [zwqo7j8w, dzdfiof5, phwoy1ea, e7hfx7mu, 40fdd...
18 [xn43bho3, uwtjxy6u, ed65xcuj, ejbgjh61, hbvzt... 2015-27 [ze4rr0vi, kw06m3f5, be532399ca86c053fb0a69d13...
What I want to do, is for each period
, which is a row, check the the % of app_id_y
values that are also in the list of app_id_x
values, for that row eg if ze4rr0vi and gm83klja are within app_id_x
which contains 53 values in that row, then there should be a new column called adoption_rate
which is: 我想做的事,是每个
period
,这是行,检查的百分比app_id_y
这也是在列表中值app_id_x
值,该行例如,如果ze4rr0vi和gm83klja不到app_id_x
包含在53个值行,那么应该有一个称为adoption_rate
的新列,该列是:
period adoption_rate
2015-9 0%
2015-22 3.56%
2015-25 4.56%
2015-26 5.10%
2015-35 4.58%
2015-36 1.23%
How about this: 这个怎么样:
df[adoption_rate] = [100.*len(set(df.loc[i,app_id_x]) &\
set(df.loc[i,app_id_y]))/len(set(df.loc[i,app_id_x]))\
if type(df.loc[i,app_id_x])==list and \
type(df.loc[i,app_id_x])==list \
else 0. for i in df.index]
Edit: fixed for the case of duplicate values in any of the arrays. 编辑:修复了任何数组中重复值的情况。
You can use numpy.intersect1d
to get the common elements between two arrays, which does the bulk of the work that needs to be done. 您可以使用
numpy.intersect1d
获取两个数组之间的公共元素,这完成了需要完成的大部分工作。 To get the output, I'm going to write a function to get the overlap percent for a given row, and then use apply
to add an adoption_rate column. 为了获得输出,我将编写一个函数以获取给定行的重叠百分比,然后使用
apply
添加adapment_rate列。
def get_overlap_pcnt(row):
# Get the overlap between arrays.
overlap = len(np.intersect1d(row['app_id_x'], row['app_id_y']))
# Compute the percent common.
if overlap == 0:
pcnt = 0
else:
pcnt = 100*overlap/len(row['app_id_y'])
return '{:.2f}%'.format(pcnt)
df['adoption_rate'] = df.apply(get_overlap_pcnt, axis=1)
I couldn't quite tell from your question if you wanted app_id_y
or app_id_x
to be the denominator, but that's an easy enough change to make. 从您的问题中我无法完全确定您是否希望
app_id_y
或app_id_x
作为分母,但这很容易进行更改。 Below is sample output using some sample data I created. 以下是使用我创建的一些示例数据的示例输出。
app_id_x app_id_y period adoption_rate
0 [a, b, c, d, e, f, g] NaN 2015-08 0.00%
1 [b, c, d] [b, c, d, e] 2015-09 75.00%
2 [a, b, c, x, y, z] [x, y, z] 2015-10 100.00%
3 [q, w, e, r, t, y] [a, b, c, d, e] 2015-11 20.00%
4 [x, y, z] [a, b, x] 2015-12 33.33%
What the other answers are missing is that this is a really unnatural way to store your data. 其他答案遗漏的是,这是存储数据的一种非常不自然的方法。 In general, the values in a pandas DataFrame should be scalars.
通常,pandas DataFrame中的值应为标量。
A better way to represent your data for the purposes of this problem is to reshape them into two dataframes, X and Y. In X, the rows are periods and the columns are the ids (eg 'g8m4lecv'). 为了解决此问题,一种更好的表示数据的方法是将它们重塑为两个数据框X和Y。在X中,行是句点,列是ID(例如'g8m4lecv')。 The entries in the X data frame are
1
if the value is in your X column in that period, and similarly for Y. 如果该时间段的值在您的X列中,则X数据框中的条目为
1
,Y则类似。
This makes it much easier to perform the kinds of operations you want to do. 这样可以更轻松地执行您想要执行的各种操作。
Here goes: 开始:
import pandas as pd
import numpy as np
# from the comment by @jezrael . Super useful, thanks
df = pd.DataFrame({'app_id_x': {10: ['pb6uhl15', 'pb6uhl15', 'pb6uhl15'], 11: ['pb6uhl15', 'pb6uhl15', 'e68t39yp', 's7xun0k1'], 12: [ 'pb6uhl15', 's7xun0k1'], 13: [ 's7xun0k1'], 14: ['pb6uhl15', 'pb6uhl15', 'e68t39yp', 's7xun0k1']}, 'app_id_y': {10: ['pb6uhl15'], 11: ['pb6uhl15'], 12: np.nan, 13: ['pb6uhl15', 'xn66n2cr', 'e68t39yp', 's7xun0k1'], 14: ['e68t39yp', 'xn66n2cr']}, 'period': {10: '2015-19', 11: '2015-20', 12: '2015-21', 13: '2015-22', 14: '2015-23'}})
# pulling the data out of the lists in the starting dataframe
new_data = []
for _,row in df.iterrows():
for col in ['app_id_x','app_id_y']:
vals = row[col]
if isinstance(vals,list):
for v in set(vals):
new_data.append((row['period'],col[-1],v,1))
new_df = pd.DataFrame(new_data, columns = ['period','which_app','val','exists'])
# splitting the data into two frames
def get_one_group(app_id):
return new_df.groupby('which_app').get_group(app_id).drop('which_app', axis=1)
X = get_one_group('x')
Y = get_one_group('y')
# converting to the desired format
def convert_to_indicator_matrix(df):
return df.set_index(['period','val']).unstack('val').fillna(0)
X = convert_to_indicator_matrix(X)
Y = convert_to_indicator_matrix(Y)
Now, it's super easy to actually solve your problem. 现在,真正解决您的问题非常容易。 I'm not clear on exactly what you need to solve, but suppose you want to know, for each period,
number_ids_in_both
divided by number_ids_in_Y
. 我不清楚您到底需要解决什么,但是假设您想知道每个时期的
number_ids_in_both
除以number_ids_in_Y
。
combined = (X * Y).fillna(0)
combined.sum(axis=1) / Y.sum(axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.