简体   繁体   English

Python:sci-kit中的特征选择可以学习正态分布

[英]Python: feature selection in sci-kit learn for a normal distribution

I have a pandas DataFrame whose index is unique user identifiers, columns corresponding to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). 我有一个pandas DataFrame,其索引是唯一的用户标识符,对应于唯一事件的列,值1(有人值守),0(未参加)或NaN(未被邀请/不相关)。 The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most. 相对于NaN,矩阵非常稀疏:有几百个事件,大多数用户最多只被邀请到几十个。

I created some extra columns to measure the "success" which I define as just % attended relative to invites: 我创建了一些额外的列来衡量“成功”,我将其定义为相对于邀请的参与率:

my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']

Assume the following is true : the success data should be normally distributed with mean 0.80 and sd 0.10. 假设以下情况属实 :成功数据应正态分布,均值为0.80,sd为0.10。 When I look at the histogram of my_data['success'] it was not normal and skewed left. 当我查看my_data['success']的直方图时,它不正常并向左倾斜。 It is not important if this is true in reality. 如果在现实中这是真的,那就不重要了。 I just want to solve the technical problem I pose below. 我只想解决下面提出的技术问题。

So this is my problem : there are some events which I don't think are "good" in the sense that they are making the success data diverge from normal. 所以这就是我的问题 :有一些事件我不认为是“好”的,因为它们使成功数据偏离正常。 I'd like to do "feature selection" on my events to pick a subset of them which makes the distribution of my_data['success'] as close to normal as possible in the sense of "convergence in distribution" . 我想在我的事件上做“特征选择”来挑选它们的一个子集,这使得my_data['success'] 分布在“分配收敛”的意义上尽可能接近正常。

I looked at the scikit-learn "feature selection" methods here and the "Univariate feature selection" seems like it makes sense. 我在这里查看了scikit-learn “特征选择”方法,“单变量特征选择”看起来很有意义。 But I'm very new to both pandas and scikit-learn and could really use help on how to actually implement this in code. 但我对pandasscikit-learn都很新,并且可以真正使用如何在代码中实际实现它的帮助。

Constraints: I need to keep at least half the original events. 约束:我需要至少保留原始事件的一半。

Any help would be greatly appreciated. 任何帮助将不胜感激。 Please share as many details as you can, I am very new to these libraries and would love to see how to do this with my DataFrame. 请分享尽可能多的详细信息,我对这些库非常新,并且很想看看如何使用我的DataFrame执行此操作。

Thanks! 谢谢!

EDIT : After looking some more at the scikit-learn feature selection approaches, "Recursive feature selection" seems like it might make sense here too but I'm not sure how to build it up with my "accuracy" metric being "close to normally distributed with mean..." 编辑 :在更多地scikit-learn特征选择方法之后,“递归特征选择”似乎也可能在这里有意义但我不确定如何构建它,我的“准确度”度量标准“接近正常”分布均值...“

Keep in mind that feature selection is to select features, not samples, ie, (typically) the columns of your DataFrame , not the rows. 请记住,功能选择是选择要素,而不是样本,即(通常)数据DataFrame的列,而不是行。 So, I am not sure if feature selection is what you want: I understand that you want to remove those samples that cause the skew in your distribution? 因此,我不确定功能选择是否符合您的要求:我了解您是否要删除导致分布偏差的样本?

Also, what about feature scaling, eg, standardization, so that your data becomes normal distributed with mean=0 and sd=1? 此外,如何进行特征缩放,例如标准化,以便您的数据变为正态分布,均值= 0且sd = 1?

The equation is simply z = (x - mean) / sd 方程式只是z =(x - mean)/ sd

To apply it to your DataFrame, you can simply do 要将它应用于您的DataFrame,您可以这样做

my_data['success'] = (my_data['success'] - my_data['success'].mean(axis=0)) / (my_data['success'].std(axis=0))

However, don't forget to keep the mean and SD parameters to transform your test data, too. 但是,不要忘记保留均值和SD参数来转换测试数据。 Alternatively, you could also use the StandardScaler from scikit-learn 或者,您也可以使用scikit-learn中的StandardScaler

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM