简体   繁体   English

Accord.Net-LibLinear上的CacheSize

[英]Accord.Net - CacheSize on LibLinear

I'm attempting to classify some inputs (text classification: 10,000+ examples, and 100,000+ features) 我正在尝试对一些输入进行分类(文本分类:10,000多个示例和100,000多个功能)

And I've read that using LibLinear is far faster / more memory efficient for such tasks, as such, I've ported my LibSvm classifier to accord/net, like so: 而且我已经读到,使用LibLinear可以更快/更有效地执行此类任务,因此,我已经将LibSvm分类器移植到Accord / net,如下所示:

        //SVM Settings
        var teacher = new MulticlassSupportVectorLearning<Linear, Sparse<double>>()
        {
            //Using LIBLINEAR's L2-loss SVC dual for each SVM
            Learner = (p) => new LinearDualCoordinateDescent<Linear, Sparse<double>>()
            {
                Loss = Loss.L2,
                Complexity = 1,
            }
        };

        var inputs = allTerms.Select(t => new Sparse<double>(t.Sentence.Select(s => s.Index).ToArray(), t.Sentence.Select(s => (double)s.Value).ToArray())).ToArray();

        var classes = allTerms.Select(t => t.Class).ToArray();

        //Train the model
        var model = teacher.Learn(inputs, classes);

At the point of .Learn() - I get an instant OutOfMemoryExcpetion . .Learn() -我得到一个即时的OutOfMemoryExcpetion

I've seen there's a CacheSize setting in the documentation, however, I cannot find where I can lower this setting, as is show in many examples. 我已经看到文档中有一个CacheSize设置,但是,如许多示例所示,我找不到可以降低此设置的位置。

One possible reason - I'm using the 'Hash trick' instead of indices - is Accord.Net attempting to allocate an array of the full hash space? 一个可能的原因-我使用的是“哈希技巧”而不是索引-是Accord.Net尝试分配完整哈希空间的数组吗? (probably close to int.MaxValue) if so - is there any way to avoid this? (可能接近int.MaxValue)-如果有,可以避免这种情况吗?

Any help is most appreciated! 任何帮助深表感谢!

Allocating hash space of 10000+ documents with 100000+ features will take at least 4 GB of memory, which may be limited by the AppDomain memory limit and CLR object size limit. 分配具有100000+个功能的10000+个文档的哈希空间将占用至少4 GB的内存,这可能会受到AppDomain内存限制和CLR对象大小限制的限制。 Many projects by default are prefered to be built under 32-bit platform, which does not allow allocation of objects more than 2GB. 默认情况下,许多项目都倾向于在32位平台下构建,该平台不允许分配超过2GB的对象。 I've managed to overcome this by removing 32-bit platform prefernce (go to project properties -> build and uncheck "Prefer 32-bit"). 我设法通过消除32位平台偏好来克服了这一点(转到项目属性->构建并取消选中“首选32位”)。 After that we should allow creation of objects more taking more than 2 GB or memory, add this line to your configuration file 之后,我们应该允许创建更多占用2 GB或更多内存的对象,将此行添加到您的配置文件中

<runtime>
    <gcAllowVeryLargeObjects enabled="true" />
</runtime>

Be aware that if you add this line but leave the 32-bit platform build preference you will still get an exception, as your project will not be able to allocate an array of such size 请注意,如果添加此行但保留32位平台构建首选项,您仍然会遇到异常,因为您的项目将无法分配此类大小的数组

This is how you tune the CacheSize 这就是调整CacheSize的方式

//SVM Settings
    var teacher = new MulticlassSupportVectorLearning<Linear, Sparse<double>>()
    {
        Learner = (p) => new SequentialMinimalOptimization<Linear, Sparse<double>>()
        {
            CacheSize = 1000
            Complexity = 1,
        }
    };

    var inputs = allTerms.Select(t => new Sparse<double>(t.Sentence.Select(s => s.Index).ToArray(), t.Sentence.Select(s => (double)s.Value).ToArray())).ToArray();

    var classes = allTerms.Select(t => t.Class).ToArray();

    //Train the model
    var model = teacher.Learn(inputs, classes);

This way of constructing an SVM does cope with Sparse<double> data structure, but it is not using LibLinear. 这种构造SVM的方式确实可以解决Sparse<double>数据结构,但是不使用LibLinear。 If you open Accord.NET repository and look at SVM solving algorithms with LibLinear support ( LinearCoordinateDescent , LinearNewtonMethod ) you will see no CacheSize property. 如果打开Accord.NET存储库并查看具有LibLinear支持的SVM解决算法( LinearCoordinateDescentLinearNewtonMethod ),将看不到CacheSize属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM