简体   繁体   English

如何在数据中找到特征重要性?

[英]How to find feature importance in data?

I want to know what column will correlated and impact to no_of_purchased but i have both numeric (eg total_item) and non-numeric data (eg shop_type)我想知道哪一列会与no_of_purchased相关联并对其产生影响,但我同时拥有数字(例如 total_item)和非数字数据(例如 shop_type)

**table of data (columns name) ** **数据表(列名)**

  1. shop_id shop_id
  2. shop_type (eg franchise,..) shop_type(例如特许经营权,..)
  3. total_item total_item
  4. is_in_business_district is_in_business_district
  5. is_creditcard_payment is_creditcard_payment
  6. total_staff_in_shop total_staff_in_shop
  7. no_of_purchased no_of_purchased

if i want to find what impact to no_of_purchase and need to include both numeric and non-numberic data.如果我想找出对 no_of_purchase 的影响并且需要同时包含数字和非数字数据。 which model and method should i use?我应该使用哪个 model 和方法?

Since you want explainability of your feature parameteres, the simplest approach would be to use simple Linear Regression or Regression with handcrafted feature values.由于您想要特征参数的可解释性,最简单的方法是使用简单的线性回归或带有手工特征值的回归。 In this way, you'll get a weight associated with a each feature (may be positive or negative) which will tell you how exactly important it is.通过这种方式,您将获得与每个特征相关的权重(可能是正数或负数),这将告诉您它的重要性。 But before you actually implement a Linear Regression Model, you'll have to do some pre-processing by converting categorical features into their One-Hot Encoded form and hopefully normalize the continuous values.但在您实际实现线性回归 Model 之前,您必须通过将分类特征转换为其 One-Hot 编码形式进行一些预处理,并希望对连续值进行标准化。

If not linear regression, you could always go with ensemble methods (eg. RandomForest Classifier, XGBoost or LightGBM).如果不是线性回归,您总是可以使用集成方法(例如 RandomForest 分类器、XGBoost 或 LightGBM) go。 They are very easy to use out of the box.它们非常容易开箱即用。 They have an inbuilt feature importance metric present and use different criteria to calculate those metrics.他们有一个内置的特征重要性指标,并使用不同的标准来计算这些指标。 You could go through their docs and see which one is more important to you.您可以通过他们的文档 go 看看哪个对您更重要。

More likely than not, ensemble methods would outperform the Linear Regression model, so in my opinion, they seem like your best bet.很有可能,集成方法会优于线性回归 model,所以在我看来,它们似乎是你最好的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM