I have a dataframe that looks like that:
df
Out[42]:
Unnamed: 0 Unnamed: 0.1 Region GeneID DistanceValue
0 25520 25520 Olfactory areas 69835573 -1.000000
1 25521 25521 Olfactory areas 583846 -1.000000
2 25522 25522 Olfactory areas 68667661 -1.000000
3 25523 25523 Olfactory areas 70474965 -1.000000
4 25524 25524 Olfactory areas 68341920 -1.000000
... ... ... ... ...
15662 1072369 1072369 Cerebellum unspecific 74743327 -0.960186
15663 1072370 1072370 Cerebellum unspecific 69530983 -0.960139
15664 1072371 1072371 Cerebellum unspecific 68442853 -0.960129
15665 1072372 1072372 Cerebellum unspecific 74514339 -0.960038
15666 1072373 1072373 Cerebellum unspecific 70724637 -0.960003
[15667 rows x 5 columns]
I want to count 'GeneID's, and create a new df, that only contains the rows with GeneID's that are there more than 5 times.. so I did
genelist = df.pivot_table(index=['GeneID'], aggfunc='size')
sort_genelist = genelist.sort_values(axis=0,ascending=False)
sort_genelist
Out[44]:
GeneID
631707 11
68269286 10
633269 10
70302366 9
74357905 9
..
70784714 1
70784824 1
70784898 1
70784916 1
70528527 1
Length: 7875, dtype: int64
So now I want my df dataframe to just contain the rows with the ID's that were counted more than 5 times..
Use Series.isin
for mask by index values of values of sort_genelist
with length more like 5
and filter by boolean indexing
:
df = df[df['GeneID'].isin(sort_genelist.index[sort_genelist > 5])]
I think that the best way to do what you have asked is:
df['gene_id_count'] = df.groupby('GeneID').transform(len)
df.loc[df['gene_id_count'] > 5, :]
Lets take this tiny example:
>>> df = pd.DataFrame({'GeneID': [1,1,1,3,4,5,5,4], 'ID': range(8)})
>>> df
GeneID ID
0 1 0
1 1 1
2 1 2
3 3 3
4 4 4
5 5 5
6 5 6
7 4 7
And consider 2 occurrences (instead of 5)
min_gene_id_count = 2
>>> df['gene_id_count'] = df.groupby('GeneID').transform(len)
>>> df
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3
3 3 3 1
4 4 4 2
5 5 5 2
6 5 6 2
7 4 7 2
>>> df.loc[df['gene_id_count'] > min_gene_id_count , :]
GeneID ID gene_id_count
0 1 0 3
1 1 1 3
2 1 2 3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.