简体繁体 English

学习人工神经网络特性？

[英]Learning artificial neural network properties?

原文 2016-12-27 12:19:01 6 1 neural-network/ artificial-intelligence

I have a data-set with around 50,000 properties per item. 我有一个数据集，每个项目约有50,000个属性。 (mostly values between 0 an 1, no discrete values at all) （通常介于0和1之间的值，根本没有离散值）

The properties are not labelled, and assumed to have no relation with each other. 这些属性未标记，并假定彼此没有关系。 + I know in advance that most properties are useless. +我事先知道大多数属性都是无用的。 (99% of them) （其中99％）

My task is to use as few properties in a neural network, such that it would know how to differentiate between 5 item types. 我的任务是在神经网络中使用尽可能少的属性，这样它将知道如何区分5种物品类型。

In theory, I could just through all 50K properties into the ANN, and hope for the best, but it would take a huge amount of time to train. 从理论上讲，我可以将所有5万个属性都放入ANN中，并希望达到最佳状态，但这需要大量的时间进行训练。 + gigabytes of RAM, and I am not sure my server won't crash. + GB的RAM，我不确定服务器不会崩溃。

Is there a model that measures the level of classification a parameter has? 是否有一个模型可以测量参数的分类级别？

If not, would the following be a good idea? 如果不是，以下内容是个好主意吗？

Go over all of my 50K parameters, and train 50K ANNS, with <1, parameter> 遍历我所有的50K参数，并用<1，parameter>训练50K ANNS
Get the maximum accuracy ANN, and start again, with 3 inputs: <1, previous-property, property>, and so on, until I get to an accuracy of 95% and then stop 获取最大精度的ANN，然后使用3个输入重新开始：<1，先前属性，属性>，依此类推，直到达到95％的精度，然后停止

I see no reason it won't work, but training at least 10*50,000 ANNs is not ideal as well. 我认为没有理由不起作用，但是训练至少10 * 50,000 ANN也不理想。

EDIT: 编辑：

I have 12 examples per category. 每个类别有12个示例。 overall 60 items. 总共60个项目。 (I am aware it is tiny, but I can't get more.) （我知道它很小，但我不能得到更多。）

1 个解决方案

Feature Selection 功能选择

I would shy away from a neural network to solve this problem. 我会避开神经网络来解决这个问题。 If you are tied to the neural network idea, then it would be possible to plug in your 50000 x 60 data matrix to the network as this shouldn't take very much ram at all. 如果您有神经网络的想法，那么可以将您的50000 x 60数据矩阵插入网络，因为这完全不需要花费很多精力。 If you use an L1 regularizer, then analyze the weights of the network afterwards for all 0 entries, you can determine which features were not useful. 如果使用L1正则化器，然后针对所有0个条目分析网络的权重，则可以确定哪些功能无效。

There are numerous other feature selection approaches as well. 还有许多其他特征选择方法。 For instance the LASSO algorithm attempts to solve this problem in a very similar way to the above neural network approach. 例如，LASSO算法尝试以与上述神经网络方法非常相似的方式解决此问题。

Another well known algorithm is forward selection regression, where you perform a regression using only one property at a time. 另一种众所周知的算法是前向选择回归，其中您一次仅使用一个属性执行回归。 You then pick the attribute that best separates the classes, fix that property then select again using two properties at a time (the best property from the last sweep, and every other property one at a time). 然后，选择最能分隔类的属性，修复该属性，然后一次使用两个属性再次选择（最后一次扫描中的最佳属性，一次其他所有属性）。 You repeat this process until adding another property gives no better class separation. 您重复此过程，直到添加另一个属性不会产生更好的类分离为止。 I would not be concerned with the time it takes to train this model if most features truly are useless. 如果大多数功能确实没有用，那么我不必担心训练这种模型所花费的时间。 Using linear regression (as it has a closed form solution) should take almost no time at all on a dataset of this size. 在这种规模的数据集上，使用线性回归（因为它具有封闭形式的解决方案）几乎不需要花费任何时间。

Feature Extraction 特征提取

A much more principled approach would be some form of principle components analysis (PCA). 一种更原则的方法是某种形式的主成分分析（PCA）。 This would show you how many collinear properties your dataset has, and would extract a small number of new properties to describe your data. 这将向您显示数据集具有多少共线属性，并将提取少量新属性来描述数据。