简体   繁体   中英

Python: feature selection in sci-kit learn for a normal distribution

I have a pandas DataFrame whose index is unique user identifiers, columns corresponding to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most.

I created some extra columns to measure the "success" which I define as just % attended relative to invites:

my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']

: the success data should be normally distributed with mean 0.80 and sd 0.10. :成功数据应正态分布,均值为0.80,sd为0.10。 When I look at the histogram of my_data['success'] it was not normal and skewed left. It is not important if this is true in reality. I just want to solve the technical problem I pose below.

: there are some events which I don't think are "good" in the sense that they are making the success data diverge from normal. :有一些事件我不认为是“好”的,因为它们使成功数据偏离正常。 I'd like to do "feature selection" on my events to pick a subset of them which makes the distribution of my_data['success'] as close to normal as possible in the sense of "convergence in distribution" .

I looked at the scikit-learn "feature selection" methods here and the "Univariate feature selection" seems like it makes sense. But I'm very new to both pandas and scikit-learn and could really use help on how to actually implement this in code.

I need to keep at least half the original events. 我需要至少保留原始事件的一半。

Any help would be greatly appreciated. Please share as many details as you can, I am very new to these libraries and would love to see how to do this with my DataFrame.

Thanks!

: After looking some more at the scikit-learn feature selection approaches, "Recursive feature selection" seems like it might make sense here too but I'm not sure how to build it up with my "accuracy" metric being "close to normally distributed with mean..." :在更多地scikit-learn特征选择方法之后,“递归特征选择”似乎也可能在这里有意义但我不确定如何构建它,我的“准确度”度量标准“接近正常”分布均值...“

Keep in mind that feature selection is to select features, not samples, ie, (typically) the columns of your DataFrame , not the rows. So, I am not sure if feature selection is what you want: I understand that you want to remove those samples that cause the skew in your distribution?

Also, what about feature scaling, eg, standardization, so that your data becomes normal distributed with mean=0 and sd=1?

The equation is simply z = (x - mean) / sd

To apply it to your DataFrame, you can simply do

my_data['success'] = (my_data['success'] - my_data['success'].mean(axis=0)) / (my_data['success'].std(axis=0))

However, don't forget to keep the mean and SD parameters to transform your test data, too. Alternatively, you could also use the StandardScaler from scikit-learn

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM