简体   繁体   中英

Keeping rows in Pandas where the same ID appears more than n times and convert to list per ID

I have a DataFrame like below:

     VID  value
1     1    xx
2     2    xx1
3     2    xx2
4     2    xx3
5     2    xx4
6     3    xx
7     3    xx
8     3    xx
9     4    zz1
10    4    zz2
11    4    zz3
12    4    zz4
13    4    zz5

I want to keep only the rows whose VID exists in more than n=3 cases. In the above example, I'd keep only rows 2-5 and 9-13 (since only the VIDs 2 and 4 appear more than 3 times). The 'value' field is irrelevant.

After keeping the needed rows, I'd like to convert my data as a list of lists while appending an 'end' value at the end of each one:

[[xx1, xx2, xx3, xx4, 'end'], [zz1, zz2, zz3, zz4, zz5, 'end']]

Is it possible to get the above without a for loop?

You can group by VID column and then take count of each group. Then after that use it for indexing your original df, so as to only take rows that have greater than 3 count. Example -

countdf = df.groupby('VID').count()
result = df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]

Demo -

In [49]: df
Out[49]:
    VID value
1     1    xx
2     2   xx1
3     2   xx2
4     2   xx3
5     2   xx4
6     3    xx
7     3    xx
8     3    xx
9     4   zz1
10    4   zz2
11    4   zz3
12    4   zz4
13    4   zz5

In [51]: df.groupby('VID').count()
Out[51]:
     value
VID
1        1
2        4
3        3
4        5

In [52]: countdf = df.groupby('VID').count()

In [53]: df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]
Out[53]:
    VID value
2     2   xx1
3     2   xx2
4     2   xx3
5     2   xx4
9     4   zz1
10    4   zz2
11    4   zz3
12    4   zz4
13    4   zz5

Then after this, you can again group based on VID and then convert the groups into list and then again to list , Example -

resultlist = result.groupby('VID')['value'].apply(list).tolist()

Demo -

In [54]: result = df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]

In [55]: result.groupby('VID')['value'].apply(list).tolist()
Out[55]: [['xx1', 'xx2', 'xx3', 'xx4'], ['zz1', 'zz2', 'zz3', 'zz4', 'zz5']]

Please note, above would not get you the 'end' value in the list, I am guessing that is not necessary, but if you really want that, you can manually add it after getting the list. Example -

resultlist = [elem + ['end'] for elem in resultlist]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM