根据.sum（）总数过滤熊猫系列

Question

I have data that contains a row per user, then many columns populated with 1 or 0 based on their interaction with a particular product category. 我的数据包含每位用户一行，然后许多列基于与特定产品类别的互动而填充为1或0 。

I am running some correlation analysis, and I'd like to remove the less significant categories to make my analysis easier to read, I used .sum() on my dataframe to see the categories that are interacted with most, but how can I now run correlation on just this set? 我正在运行一些相关性分析，我想删除不太重要的类别以使分析更易于阅读，我在数据.sum()使用了.sum()来查看与大多数内容交互的类别，但是现在如何仅对此集合运行相关性？

Here is the a sample of the outpul from my .sum() : 这是我的.sum()的输出样本：

shoes_and_flats                                                                                           37
nightwear_and_slippers                                                                                    61
shorts_and_shorts                                                                                         23
accessories_and_fragrance                                                                                 25
jackets_and_coats_and_wool                                                                                12
dresses_and_skirts_and_sleeveless_dresses                                                                 35
swimwear_and_bikinis                                                                                      49
dresses_and_skirts_and_floral_dresses                                                                      7
jackets_and_coats_and_harrington_jackets                                                                  18
dresses_and_skirts_and_tunic_dresses                                                                       8
sports_performance_tops_and_vests                                                                          4
jeans_and_bootcut_jeans                                                                                    2
nightwear_and_nightwear                                                                                    1

Created by doing... 通过做...创建

totals = df.sum()

I decided that I'd like to remove categories with less than 50 interactions, so I used... totals = totals[1: -1].sort_values() > 50 我决定要删除互动次数少于50的类别，因此我使用了... totals = totals[1: -1].sort_values() > 50

But that returns all categories regardless of their True or False value. 但这会返回所有类别，无论其True或False值如何。

My end goal is to use .corr() on the data, how can I run this and only return a grid where the categories have more than 50 interactions? 我的最终目标是在数据上使用.corr() ，如何运行此函数，并且仅返回类别具有超过50个交互的网格？

Answer 1

You want to filter the columns in the dataframe. 您要过滤数据框中的列。 You're on the right track with the True and False results, you just have to use this as a filter 你是正确的轨道上的True与False的结果，你只需要使用它作为一个过滤器

Assuming the data is in a dataframe called df , this will return only the columns you want: 假设数据在一个名为df的数据帧中，这将仅返回您想要的列：

totals = df.sum()
df[totals[totals > 50].index]

Answer 2

I believe you could use: 我相信您可以使用：

totals = totals[totals > 50]

Edit: The syntax of the accepted answer above was not working for me so just in case this happens to someone else here is what I found worked 编辑：上面接受的答案的语法对我不起作用，所以以防万一这发生在别人身上，这是我发现起作用的

totals = df.sum()
totals = totals[totals > 50]
df_more_than_50 = df.filter(totals.index))

根据.sum（）总数过滤熊猫系列

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-06-12 12:10:14

解决方案2
0 2018-06-12 12:10:36

根据.sum（）总数过滤熊猫系列

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-06-12 12:10:14

解决方案2 0 2018-06-12 12:10:36

解决方案1
2 已采纳 2018-06-12 12:10:14

解决方案2
0 2018-06-12 12:10:36