简体   繁体   English

在 Tensorflow 中执行特征选择

[英]Perform feature selection in Tensorflow

I am working on a Python project involving regression to predict some values.我正在研究一个涉及回归来预测某些值的 Python 项目。 The input is a data set consisting of 70 features which are a mix of categorical and ordinal variables.输入是一个由70 个特征组成的数据集,这些特征分类变量和有序变量的混合。 The dependent variable is continuous .因变量是连续的

The input would be data and the number of significant variables .输入将是数据和重要变量的数量

i had some questions which are mentioned below.我有一些问题,下面提到。

1] Is there a way to perform feature selection using forward selection technique in Tensorflow ? 1] 有没有办法在Tensorflow 中使用前向选择技术执行特征选择?

2] are there alternatives to feature selection ? 2] 是否有替代特征选择的方法?

Problem问题

I have N number of features (N = 70 for example) and I want to select the K top features.我有 N 个特征(例如 N = 70),我想选择 K 个顶级特征。 (1) How could I do this in TensorFlow and (2) what alternatives are there to feature selection. (1) 我怎么能在 TensorFlow 中做到这一点和 (2) 有哪些替代特征选择。

Discussion讨论

I will show one way to limit the number of N features to at most K using a variant of L1 loss.我将展示一种使用 L1 损失的变体将 N 个特征的数量限制为最多 K 的方法。 As for alternatives to feature selection there are many depending on what you want to achieve.至于功能选择的替代方案,有很多取决于您想要实现的目标。 If you can go outside of TensorFlow then you could use Decision Trees or a Random Forest and simply constrain the number of leaves to use at most K features.如果您可以在 TensorFlow 之外使用,那么您可以使用决策树或随机森林,并简单地限制最多使用 K 个特征的叶子数量。 If you must use TensorFlow and you want an alternative to top K features that regularizes your weights you could use random dropout or L2 loss.如果您必须使用 TensorFlow 并且想要替代前 K 个特征来规范您的权重,您可以使用随机 dropout 或 L2 损失。 Again, it really depends on what you want to achieve when looking for an alternative to top K features.同样,这实际上取决于您在寻找顶级 K 功能的替代方案时想要实现的目标。

Solution to Top K of N features. N 个特征中 Top K 的解决方案。

Supposed our TensorFlow graph is defined as follows假设我们的 TensorFlow 图定义如下

import tensorflow as tf

size = 4

x_in = tf.placeholder( shape=[None,size] , dtype=tf.float32 )
y_in = tf.placeholder( shape=[None] , dtype=tf.float32 )
l1_weight = tf.placeholder( shape=[] , dtype=tf.float32 )

m = tf.Variable( tf.random_uniform( shape=[size,1] , minval=0.1 , maxval=0.9 ) , dtype=tf.float32 )
m = tf.nn.relu(m)
b = tf.Variable([-10], dtype=tf.float32 )

predict = tf.squeeze( tf.nn.xw_plus_b(x_in,m,b) )

l1_loss = tf.reduce_sum(tf.abs(m))
loss = tf.reduce_mean( tf.square( y_in - predict ) ) + l1_loss * l1_weight

optimizer = tf.train.GradientDescentOptimizer(1e-4)
train = optimizer.minimize(loss)

m_all_0 = tf.zeros( [size,1] , dtype=tf.float32 )
zerod_feature_count = tf.reduce_sum( tf.cast( tf.equal( m , m_all_0 ) , dtype=tf.float32 ) )
k = size - zerod_feature_count

Let's define some data and use this让我们定义一些数据并使用它

import numpy as np
data = np.array([[5.1,3.5,1.4,0.2,0],[4.9,3.0,1.4,0.2,0],[4.7,3.2,1.3,0.2,0],[4.6,3.1,1.5,0.2,0],[5.0,3.6,1.4,0.2,0],[5.4,3.9,1.7,0.4,0],[4.6,3.4,1.4,0.3,0],[5.0,3.4,1.5,0.2,0],[4.4,2.9,1.4,0.2,0],[4.9,3.1,1.5,0.1,0],[5.4,3.7,1.5,0.2,0],[4.8,3.4,1.6,0.2,0],[4.8,3.0,1.4,0.1,0],[4.3,3.0,1.1,0.1,0],[5.8,4.0,1.2,0.2,0],[5.7,4.4,1.5,0.4,0],[5.4,3.9,1.3,0.4,0],[5.1,3.5,1.4,0.3,0],[5.7,3.8,1.7,0.3,0],[5.1,3.8,1.5,0.3,0],[5.4,3.4,1.7,0.2,0],[5.1,3.7,1.5,0.4,0],[4.6,3.6,1.0,0.2,0],[5.1,3.3,1.7,0.5,0],[4.8,3.4,1.9,0.2,0],[5.0,3.0,1.6,0.2,0],[5.0,3.4,1.6,0.4,0],[5.2,3.5,1.5,0.2,0],[5.2,3.4,1.4,0.2,0],[4.7,3.2,1.6,0.2,0],[4.8,3.1,1.6,0.2,0],[5.4,3.4,1.5,0.4,0],[5.2,4.1,1.5,0.1,0],[5.5,4.2,1.4,0.2,0],[4.9,3.1,1.5,0.1,0],[5.0,3.2,1.2,0.2,0],[5.5,3.5,1.3,0.2,0],[4.9,3.1,1.5,0.1,0],[4.4,3.0,1.3,0.2,0],[5.1,3.4,1.5,0.2,0],[5.0,3.5,1.3,0.3,0],[4.5,2.3,1.3,0.3,0],[4.4,3.2,1.3,0.2,0],[5.0,3.5,1.6,0.6,0],[5.1,3.8,1.9,0.4,0],[4.8,3.0,1.4,0.3,0],[5.1,3.8,1.6,0.2,0],[4.6,3.2,1.4,0.2,0],[5.3,3.7,1.5,0.2,0],[5.0,3.3,1.4,0.2,0],[7.0,3.2,4.7,1.4,1],[6.4,3.2,4.5,1.5,1],[6.9,3.1,4.9,1.5,1],[5.5,2.3,4.0,1.3,1],[6.5,2.8,4.6,1.5,1],[5.7,2.8,4.5,1.3,1],[6.3,3.3,4.7,1.6,1],[4.9,2.4,3.3,1.0,1],[6.6,2.9,4.6,1.3,1],[5.2,2.7,3.9,1.4,1],[5.0,2.0,3.5,1.0,1],[5.9,3.0,4.2,1.5,1],[6.0,2.2,4.0,1.0,1],[6.1,2.9,4.7,1.4,1],[5.6,2.9,3.6,1.3,1],[6.7,3.1,4.4,1.4,1],[5.6,3.0,4.5,1.5,1],[5.8,2.7,4.1,1.0,1],[6.2,2.2,4.5,1.5,1],[5.6,2.5,3.9,1.1,1],[5.9,3.2,4.8,1.8,1],[6.1,2.8,4.0,1.3,1],[6.3,2.5,4.9,1.5,1],[6.1,2.8,4.7,1.2,1],[6.4,2.9,4.3,1.3,1],[6.6,3.0,4.4,1.4,1],[6.8,2.8,4.8,1.4,1],[6.7,3.0,5.0,1.7,1],[6.0,2.9,4.5,1.5,1],[5.7,2.6,3.5,1.0,1],[5.5,2.4,3.8,1.1,1],[5.5,2.4,3.7,1.0,1],[5.8,2.7,3.9,1.2,1],[6.0,2.7,5.1,1.6,1],[5.4,3.0,4.5,1.5,1],[6.0,3.4,4.5,1.6,1],[6.7,3.1,4.7,1.5,1],[6.3,2.3,4.4,1.3,1],[5.6,3.0,4.1,1.3,1],[5.5,2.5,4.0,1.3,1],[5.5,2.6,4.4,1.2,1],[6.1,3.0,4.6,1.4,1],[5.8,2.6,4.0,1.2,1],[5.0,2.3,3.3,1.0,1],[5.6,2.7,4.2,1.3,1],[5.7,3.0,4.2,1.2,1],[5.7,2.9,4.2,1.3,1],[6.2,2.9,4.3,1.3,1],[5.1,2.5,3.0,1.1,1],[5.7,2.8,4.1,1.3,1]])

x = data[:,0:4]
y = data[:,-1]

sess = tf.Session()
sess.run(tf.global_variables_initializer())

Let's define a reusable trial function that can test different weights of l1_loss让我们定义一个可重用的试验函数,可以测试不同权重的 l1_loss

def trial(weight=1):
    print "initial m,loss",sess.run([m,loss],feed_dict={x_in:x, y_in:y, l1_weight:0})
    for _ in range(10000):
        sess.run(train,feed_dict={x_in:x, y_in:y, l1_weight:0})
    print "after training m,loss",sess.run([m,loss],feed_dict={x_in:x, y_in:y, l1_weight:0})
    for _ in range(10000):
        sess.run(train,feed_dict={x_in:x, y_in:y, l1_weight:weight})
        if sess.run(k) <= 3 :
            break
    print "after l1 loss m",sess.run([m,loss],feed_dict={x_in:x, y_in:y, l1_weight:weight})

Then we'll try it out那我们就试试看

print "The number of non-zero parameters is",sess.run(k)
print "Doing a training session"
trial()
print "The number of non-zero parameters is",sess.run(k)

And the results look good结果看起来不错

The number of non-zero parameters is 4.0
Doing a training session
initial m,loss [array([[0.75030506],
       [0.4959089 ],
       [0.646675  ],
       [0.44027993]], dtype=float32), 8.228347]
after training m,loss [array([[1.1898338 ],
       [1.0033115 ],
       [0.15164669],
       [0.16414128]], dtype=float32), 0.68957466]
after l1 loss m [array([[1.250356  ],
       [0.92532456],
       [0.10235767],
       [0.        ]], dtype=float32), 2.9621665]
The number of non-zero parameters is 3.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM