简体   繁体   中英

Filter Pandas series based on .sum() totals

I have data that contains a row per user, then many columns populated with 1 or 0 based on their interaction with a particular product category.

I am running some correlation analysis, and I'd like to remove the less significant categories to make my analysis easier to read, I used .sum() on my dataframe to see the categories that are interacted with most, but how can I now run correlation on just this set?

Here is the a sample of the outpul from my .sum() :

shoes_and_flats                                                                                           37
nightwear_and_slippers                                                                                    61
shorts_and_shorts                                                                                         23
accessories_and_fragrance                                                                                 25
jackets_and_coats_and_wool                                                                                12
dresses_and_skirts_and_sleeveless_dresses                                                                 35
swimwear_and_bikinis                                                                                      49
dresses_and_skirts_and_floral_dresses                                                                      7
jackets_and_coats_and_harrington_jackets                                                                  18
dresses_and_skirts_and_tunic_dresses                                                                       8
sports_performance_tops_and_vests                                                                          4
jeans_and_bootcut_jeans                                                                                    2
nightwear_and_nightwear                                                                                    1

Created by doing...

totals = df.sum()

I decided that I'd like to remove categories with less than 50 interactions, so I used... totals = totals[1: -1].sort_values() > 50

But that returns all categories regardless of their True or False value.

My end goal is to use .corr() on the data, how can I run this and only return a grid where the categories have more than 50 interactions?

You want to filter the columns in the dataframe. You're on the right track with the True and False results, you just have to use this as a filter

Assuming the data is in a dataframe called df , this will return only the columns you want:

totals = df.sum()
df[totals[totals > 50].index]

I believe you could use:

totals = totals[totals > 50]

Edit: The syntax of the accepted answer above was not working for me so just in case this happens to someone else here is what I found worked

totals = df.sum()
totals = totals[totals > 50]
df_more_than_50 = df.filter(totals.index))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM