I have the following pandas dataframe
import pandas as pd
df = pd.read_csv("filename1.csv")
df
column1 column2 column3
0 10 A 1
1 15 A 1
2 19 B 1
3 5071 B 0
4 5891 B 0
5 3210 B 0
6 12 B 2
7 13 C 2
8 20 C 0
9 5 C 3
10 9 C 3
Now, using the function value_counts()
will give me the counts of each value in a certain column, eg
df.column3.value_counts()
1 3
2 2
3 2
However, I would like to subset a pandas dataframe based on the number of values in a given column. For example, in the above dataframe df
, I would like to subset on rows with 3 or more unique values (excluding 0). In this case, the resulting dataframe would be
df
column1 column2 column3
0 10 A 1
1 15 A 1
2 19 B 1
As the rows for values 2 and 3 only had two rows, ie 2, 3 only occurred twice in column3
. What is the pandas way to do this?
You can use groupby.filter
; In the filter, construct a unique boolean value for each group to filter the data frame:
df.groupby("column3").filter(lambda g: (g.name != 0) and (g.column3.size >= 3))
Another option could be:
df[(df.column3 != 0) & (df.groupby("column3").column3.transform("size") >= 3)]
或者您可以在分组之前过滤掉零:
df1[df1['column3'] != 0].groupby("column3").filter(lambda x: x['column3'].size >= 3 )
Alternative solution:
In [132]: cnt = df.column3.value_counts()
In [133]: cnt
Out[133]:
0 4
1 3
3 2
2 2
Name: column3, dtype: int64
In [134]: v = cnt[(cnt.index != 0) & (cnt >= 3)].index.values
In [135]: v
Out[135]: array([1], dtype=int64)
In [136]: df.query("column3 in @v")
Out[136]:
column1 column2 column3
0 10 A 1
1 15 A 1
2 19 B 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.