简体   繁体   English

根据.sum()总数过滤熊猫系列

[英]Filter Pandas series based on .sum() totals

I have data that contains a row per user, then many columns populated with 1 or 0 based on their interaction with a particular product category. 我的数据包含每位用户一行,然后许多列基于与特定产品类别的互动而填充为10

I am running some correlation analysis, and I'd like to remove the less significant categories to make my analysis easier to read, I used .sum() on my dataframe to see the categories that are interacted with most, but how can I now run correlation on just this set? 我正在运行一些相关性分析,我想删除不太重要的类别以使分析更易于阅读,我在数据.sum()使用了.sum()来查看与大多数内容交互的类别,但是现在如何仅对此集合运行相关性?

Here is the a sample of the outpul from my .sum() : 这是我的.sum()的输出样本:

shoes_and_flats                                                                                           37
nightwear_and_slippers                                                                                    61
shorts_and_shorts                                                                                         23
accessories_and_fragrance                                                                                 25
jackets_and_coats_and_wool                                                                                12
dresses_and_skirts_and_sleeveless_dresses                                                                 35
swimwear_and_bikinis                                                                                      49
dresses_and_skirts_and_floral_dresses                                                                      7
jackets_and_coats_and_harrington_jackets                                                                  18
dresses_and_skirts_and_tunic_dresses                                                                       8
sports_performance_tops_and_vests                                                                          4
jeans_and_bootcut_jeans                                                                                    2
nightwear_and_nightwear                                                                                    1

Created by doing... 通过做...创建

totals = df.sum()

I decided that I'd like to remove categories with less than 50 interactions, so I used... totals = totals[1: -1].sort_values() > 50 我决定要删除互动次数少于50的类别,因此我使用了... totals = totals[1: -1].sort_values() > 50

But that returns all categories regardless of their True or False value. 但这会返回所有类别,无论其TrueFalse值如何。

My end goal is to use .corr() on the data, how can I run this and only return a grid where the categories have more than 50 interactions? 我的最终目标是在数据上使用.corr() ,如何运行此函数,并且仅返回类别具有超过50个交互的网格?

You want to filter the columns in the dataframe. 您要过滤数据框中的列。 You're on the right track with the True and False results, you just have to use this as a filter 你是正确的轨道上的TrueFalse的结果,你只需要使用它作为一个过滤器

Assuming the data is in a dataframe called df , this will return only the columns you want: 假设数据在一个名为df的数据帧中,这将仅返回您想要的列:

totals = df.sum()
df[totals[totals > 50].index]

I believe you could use: 我相信您可以使用:

totals = totals[totals > 50]

Edit: The syntax of the accepted answer above was not working for me so just in case this happens to someone else here is what I found worked 编辑:上面接受的答案的语法对我不起作用,所以以防万一这发生在别人身上,这是我发现起作用的

totals = df.sum()
totals = totals[totals > 50]
df_more_than_50 = df.filter(totals.index))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM