简体   繁体   English

学习人工神经网络特性?

[英]Learning artificial neural network properties?

I have a data-set with around 50,000 properties per item. 我有一个数据集,每个项目约有50,000个属性。 (mostly values between 0 an 1, no discrete values at all) (通常介于0和1之间的值,根本没有离散值)

The properties are not labelled, and assumed to have no relation with each other. 这些属性未标记,并假定彼此没有关系。 + I know in advance that most properties are useless. +我事先知道大多数属性都是无用的。 (99% of them) (其中99%)

My task is to use as few properties in a neural network, such that it would know how to differentiate between 5 item types. 我的任务是在神经网络中使用尽可能少的属性,这样它将知道如何区分5种物品类型。

In theory, I could just through all 50K properties into the ANN, and hope for the best, but it would take a huge amount of time to train. 从理论上讲,我可以将所有5万个属性都放入ANN中,并希望达到最佳状态,但这需要大量的时间进行训练。 + gigabytes of RAM, and I am not sure my server won't crash. + GB的RAM,我不确定服务器不会崩溃。

Is there a model that measures the level of classification a parameter has? 是否有一个模型可以测量参数的分类级别?

If not, would the following be a good idea? 如果不是,以下内容是个好主意吗?

  • Go over all of my 50K parameters, and train 50K ANNS, with <1, parameter> 遍历我所有的50K参数,并用<1,parameter>训练50K ANNS
  • Get the maximum accuracy ANN, and start again, with 3 inputs: <1, previous-property, property>, and so on, until I get to an accuracy of 95% and then stop 获取最大精度的ANN,然后使用3个输入重新开始:<1,先前属性,属性>,依此类推,直到达到95%的精度,然后停止

I see no reason it won't work, but training at least 10*50,000 ANNs is not ideal as well. 我认为没有理由不起作用,但是训练至少10 * 50,000 ANN也不理想。

EDIT: 编辑:

I have 12 examples per category. 每个类别有12个示例。 overall 60 items. 总共60个项目。 (I am aware it is tiny, but I can't get more.) (我知道它很小,但我不能得到更多。)

Feature Selection 功能选择

I would shy away from a neural network to solve this problem. 我会避开神经网络来解决这个问题。 If you are tied to the neural network idea, then it would be possible to plug in your 50000 x 60 data matrix to the network as this shouldn't take very much ram at all. 如果您有神经网络的想法,那么可以将您的50000 x 60数据矩阵插入网络,因为这完全不需要花费很多精力。 If you use an L1 regularizer, then analyze the weights of the network afterwards for all 0 entries, you can determine which features were not useful. 如果使用L1正则化器,然后针对所有0个条目分析网络的权重,则可以确定哪些功能无效。

There are numerous other feature selection approaches as well. 还有许多其他特征选择方法。 For instance the LASSO algorithm attempts to solve this problem in a very similar way to the above neural network approach. 例如,LASSO算法尝试以与上述神经网络方法非常相似的方式解决此问题。

Another well known algorithm is forward selection regression, where you perform a regression using only one property at a time. 另一种众所周知的算法是前向选择回归,其中您一次仅使用一个属性执行回归。 You then pick the attribute that best separates the classes, fix that property then select again using two properties at a time (the best property from the last sweep, and every other property one at a time). 然后,选择最能分隔类的属性,修复该属性,然后一次使用两个属性再次选择(最后一次扫描中的最佳属性,一次其他所有属性)。 You repeat this process until adding another property gives no better class separation. 您重复此过程,直到添加另一个属性不会产生更好的类分离为止。 I would not be concerned with the time it takes to train this model if most features truly are useless. 如果大多数功能确实没有用,那么我不必担心训练这种模型所花费的时间。 Using linear regression (as it has a closed form solution) should take almost no time at all on a dataset of this size. 在这种规模的数据集上,使用线性回归(因为它具有封闭形式的解决方案)几乎不需要花费任何时间。

Feature Extraction 特征提取

A much more principled approach would be some form of principle components analysis (PCA). 一种更原则的方法是某种形式的主成分分析(PCA)。 This would show you how many collinear properties your dataset has, and would extract a small number of new properties to describe your data. 这将向您显示数据集具有多少共线属性,并将提取少量属性来描述数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM