简体   繁体   中英

Filter out DataFrame rows that have insufficient number of observations based on a column defining a category in pandas

I have a DataFrame with a column that divides the data set into a set of categories. I would like to remove those categories that have a small number of observations.

Example

df = pd.DataFrame({'c': ['c1', 'c2', 'c1', 'c3', 'c4', 'c5', 'c2'], 'v': [5, 2, 7, 1, 2, 8, 3]})

    c  v
0  c1  5
1  c2  2
2  c1  7
3  c3  1
4  c4  2
5  c5  8
6  c2  3

For column c and n = 2 , remove all the rows that have less than n same values in column c , resulting in:

    c  v
0  c1  5
1  c2  2
2  c1  7
3  c2  3

Using pd.Series.value_counts followed by Boolean indexing via pd.Series.isin :

counts = df['c'].value_counts()  # create series of counts
idx = counts[counts < 2].index   # filter for indices with < 2 counts

res = df[~df['c'].isin(idx)]     # filter dataframe

print(res)

    c  v
0  c1  5
1  c2  2
2  c1  7
6  c2  3

by using groupby This can be achieved as below:

mask = df.groupby('c').count().reset_index()
mask = mask.loc[mask['v'] < 2]
res = df[~df.c.isin(mask.c.values)]
print(res)

output:

    c  v
0  c1  5
1  c2  2
2  c1  7
6  c2  3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM