简体   繁体   English

Accord.NET中的回归分析

[英]Regression analyses in Accord.NET

Currently I am working on my project at school and I have a bit extraordinary task. 目前我在学校的项目工作,我有一个非常特别的任务。 My job is to scrape the data from a certain page on the facebook put that into learning model, where it should have 1 input as List and output as Int32. 我的工作是从facebook上的某个页面中删除数据,将其放入学习模型中,其中应该有1个输入作为List并输出为Int32。

Firstly, let me briefly explain algorithms I already designed: 首先,让我简要解释一下我已经设计的算法:

  1. Scraped the data 刮掉了数据
  2. Stemmed it 阻止它
  3. Removed capitalization, punctuation, emojis and spaces 删除了大写,标点符号,表情符号和空格
  4. Merged words with the same root 合并具有相同根的单词
  5. Counted occurrence of words and assigned count value to every word 计算每个单词的单词和指定计数值的出现次数
  6. Performed tf-idf calculation to extract weights of every word in every post Now, I have a Dictionary<String,List<double[],int>> , which represents 执行tf-idf计算以提取每个帖子中每个单词的权重现在,我有一个Dictionary<String,List<double[],int>> ,它代表

postId:[wordWeights],amountOfLikes as postId:[wordWeights],amountOfLikes as

23425234_35242352:[0.027,0.031,0.009,0.01233],89

I have to train my model with different posts and their likes. 我必须用不同的帖子和他们的喜欢训练我的模型。 For this purpose, have chosen to use Accord.NET library on C# and so far analyzed their Simple Linear Regression Class. 为此,选择在C#上使用Accord.NET库并且到目前为止已经分析了它们的简单线性回归类。

Firstly, I saw that I can use OrdinaryLeastSqure and feed it with possible inputs and ouputs as 首先,我看到我可以使用OrdinaryLeastSqure并将其与可能的输入和输出一起提供

double[] input = {0.123,0.23,0.09}
double[] output = {98,0,0}
OrdinaryLeastSquares ols = new OrdinaryLeastSquares();
regression = ols.Learn(inputs, output);

As you can see number of inputs in array should match number of outputs, therefore, I fulfilled it with zeros. 正如您所看到的,数组中的输入数量应该与输出数量相匹配,因此,我用零来实现它。 As a result, I got obvious wrong output. 结果,我得到了明显错误的输出。 I cannot come up with a proper way of feeding my data to Linear Regression Class . 我无法想出一种将我的数据输入Linear Regression Class的正确方法。 I know that approach with fulfilling the array with zero's is wrong, but it is so far the only solution I came up with. 我知道用零实现数组的方法是错误的,但它是迄今为止我提出的唯一解决方案。 I would appreciate if anyone tells me the way I should use regression in this case and helps in choosing a proper algorithm. 如果有人告诉我在这种情况下应该使用回归的方式并且有助于选择合适的算法,我将不胜感激。 Cheers! 干杯!

After browsing different regression algorithms in Accord.NET, I came up with FanChenLinSupportVectorRegression , which was a part of the Accord.NET Machine Learning library. 在Accord.NET中浏览不同的回归算法之后,我想出了FanChenLinSupportVectorRegression ,它是Accord.NET Machine Learning库的一部分。 I believe, Fan Chen Lin was one of the major contributors of this algorithm, since it was called after his name. 我相信,Fan Chen Lin是这个算法的主要贡献者之一,因为它是以他的名字命名的。

Algorithm uses a concept of support vector regression (SVM). 算法使用支持向量回归(SVM)的概念。

FanChenLinSupportVectorRegression<TKernel> , where Kernel gets or sets the kernel function use to create a kernel Support Vector Machine. FanChenLinSupportVectorRegression<TKernel> ,其中Kernel获取或设置内核函数用于创建内核支持向量机。 If this property is set, UseKernelEstimation will be set to false. 如果设置了此属性,则UseKernelEstimation将设置为false。

Regression function takes first input as an array, consisting of arrays of doubles (in our case weights of words in a certain post) and second an array of doubles, which consists of amount of likes. 回归函数将第一个输入作为一个数组,由双数组(在我们的例子中是某个帖子中的单词的权重)组成,第二个是双数组,由喜欢的数量组成。

IMPORTANT: sub-array of weights MUST correspond to the amount of likes in a second input in such a way that first sub-array has its like amount under [0] index in the likes array, second sub-array should have its like amount under [1] index in the likes array etc. 重要事项:权重的子数组必须对应于第二个输入中的喜欢量,使得first sub-arraylikes数组中的[0]索引下具有相同的数量, second sub-array应该具有相同的数量在[1]索引中的likes数组等

Example: 例:

//Suppose those are posts with tf-idf weights
double[][] inputs =
{
  new[] { 3.0, 1.0 },
  new[] { 7.0, 1.0 },
  new[] { 3.0, 1.0 },
  new[] { 3.0, 2.0 },
  new[] { 6.0, 1.0 },
};
//amount of likes each corresponding post scored
double[] outputs = {2.0, 3.0, 4.0, 11.0, 6.0};
//Using FanChenLinSupportVectorRegression<Kernel>
var model = new FanChenLinSupportVectorRegression<Gaussian>();
//Train model and feed it with tf-idf of each post and corresponding like amount
var svm = model.Learn(inputs, outputs);
//Run a sample tf-idf input to get a prediction
double result = svm.Score(new double[] { 2.0,6.0});

I have tested this model with swapped inputs of the same value and results were pretty nice and accurate. 我已经使用相同值的交换输入测试了此模型,结果非常好且准确。 Model works nice on big inputs as well, however requires more training. 模型在大输入上也很好,但是需要更多的训练。 Hope this helps anybody in the future. 希望这有助于未来的任何人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM