I have data that contains a row per user, then many columns populated with 1
or 0
based on their interaction with a particular product category.
I am running some correlation analysis, and I'd like to remove the less significant categories to make my analysis easier to read, I used .sum()
on my dataframe to see the categories that are interacted with most, but how can I now run correlation on just this set?
Here is the a sample of the outpul from my .sum()
:
shoes_and_flats 37
nightwear_and_slippers 61
shorts_and_shorts 23
accessories_and_fragrance 25
jackets_and_coats_and_wool 12
dresses_and_skirts_and_sleeveless_dresses 35
swimwear_and_bikinis 49
dresses_and_skirts_and_floral_dresses 7
jackets_and_coats_and_harrington_jackets 18
dresses_and_skirts_and_tunic_dresses 8
sports_performance_tops_and_vests 4
jeans_and_bootcut_jeans 2
nightwear_and_nightwear 1
Created by doing...
totals = df.sum()
I decided that I'd like to remove categories with less than 50 interactions, so I used... totals = totals[1: -1].sort_values() > 50
But that returns all categories regardless of their True
or False
value.
My end goal is to use .corr()
on the data, how can I run this and only return a grid where the categories have more than 50 interactions?
You want to filter the columns in the dataframe. You're on the right track with the True
and False
results, you just have to use this as a filter
Assuming the data is in a dataframe called df
, this will return only the columns you want:
totals = df.sum()
df[totals[totals > 50].index]
I believe you could use:
totals = totals[totals > 50]
Edit: The syntax of the accepted answer above was not working for me so just in case this happens to someone else here is what I found worked
totals = df.sum()
totals = totals[totals > 50]
df_more_than_50 = df.filter(totals.index))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.