[英]segmenting a dataframe by matching a portion of rows
I would like to select only the predominant part of a DF. 我只选择DF的主要部分。 For example, given
例如,给定
id_B, supportProgress
id1, A
id1, A
id1, A
id1, A
id1, A
id1, B
id1, B
Output is: 输出为:
id_B, supportProgress
id1, A
id1, A
id1, A
id1, A
id1, A
I cannot apply a simple filter as I don't know what the values of supportProgress are. 我无法应用简单的过滤器,因为我不知道supportProgress的值是什么。 In another DF could be supportProgress = C,C,C,C,C,D,D and, I want only select the part corresponding to C,C,C,C,C.
在另一个DF中可以是supportProgress = C,C,C,C,C,D,D,并且我只想选择与C,C,C,C,C对应的部分。
My idea is to do a df.groupby(['supportProgress'])
and select the portion that covers more than 80% of the len(df)
. 我的想法是执行
df.groupby(['supportProgress'])
并选择覆盖len(df)
80%以上的部分。
我不知道这80%,但是要获取最频繁的支持supportProgress
数据,您可以使用以下方法:
df[df['supportProgress'] == df['supportProgress'].value_counts().index[0]]
You need value_counts
first: 您首先需要
value_counts
:
a = df['supportProgress'].value_counts(normalize=True)
print (a)
A 0.714286
B 0.285714
Name: supportProgress, dtype: float64
#get all values by conditions
b = a.index[a > .8]
#if return no value, get all values
b = a.index if b.empty else b
print (b)
Index(['A', 'B'], dtype='object')
#last filter
df = df[df['supportProgress'].isin(b)]
print (df)
id_B supportProgress
0 id1 A
1 id1 A
2 id1 A
3 id1 A
4 id1 A
5 id1 B
6 id1 B
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.