简体   繁体   中英

Creating new pandas dataframe from pivottable condition

I have a dataframe that looks like that:

df
Out[42]: 
       Unnamed: 0  Unnamed: 0.1                 Region    GeneID  DistanceValue
0           25520         25520        Olfactory areas  69835573      -1.000000
1           25521         25521        Olfactory areas    583846      -1.000000
2           25522         25522        Olfactory areas  68667661      -1.000000
3           25523         25523        Olfactory areas  70474965      -1.000000
4           25524         25524        Olfactory areas  68341920      -1.000000
          ...           ...                    ...       ...            ...
15662     1072369       1072369  Cerebellum unspecific  74743327      -0.960186
15663     1072370       1072370  Cerebellum unspecific  69530983      -0.960139
15664     1072371       1072371  Cerebellum unspecific  68442853      -0.960129
15665     1072372       1072372  Cerebellum unspecific  74514339      -0.960038
15666     1072373       1072373  Cerebellum unspecific  70724637      -0.960003

[15667 rows x 5 columns]

I want to count 'GeneID's, and create a new df, that only contains the rows with GeneID's that are there more than 5 times.. so I did

genelist =  df.pivot_table(index=['GeneID'], aggfunc='size')
sort_genelist = genelist.sort_values(axis=0,ascending=False)

sort_genelist
Out[44]: 
GeneID
631707      11
68269286    10
633269      10
70302366     9
74357905     9
            ..
70784714     1
70784824     1
70784898     1
70784916     1
70528527     1
Length: 7875, dtype: int64

So now I want my df dataframe to just contain the rows with the ID's that were counted more than 5 times..

Use Series.isin for mask by index values of values of sort_genelist with length more like 5 and filter by boolean indexing :

df = df[df['GeneID'].isin(sort_genelist.index[sort_genelist > 5])]

I think that the best way to do what you have asked is:

df['gene_id_count'] = df.groupby('GeneID').transform(len)
df.loc[df['gene_id_count'] > 5, :]

Lets take this tiny example:

>>> df = pd.DataFrame({'GeneID': [1,1,1,3,4,5,5,4], 'ID': range(8)})
>>> df
   GeneID  ID
0       1   0
1       1   1
2       1   2
3       3   3
4       4   4
5       5   5
6       5   6
7       4   7

And consider 2 occurrences (instead of 5)

min_gene_id_count = 2

>>> df['gene_id_count'] = df.groupby('GeneID').transform(len)
>>> df
   GeneID  ID  gene_id_count
0       1   0              3
1       1   1              3
2       1   2              3
3       3   3              1
4       4   4              2
5       5   5              2
6       5   6              2
7       4   7              2


>>> df.loc[df['gene_id_count'] > min_gene_id_count , :]
   GeneID  ID  gene_id_count
0       1   0              3
1       1   1              3
2       1   2              3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM