简体   繁体   English

如何计算与 Spark 的 Spearman 相关系数? 我无法从统计书中复制样本

[英]How can I calculate a Spearman coefficient of correlation with Spark ? I am unable to reproduce a sample from a statistic book

To train myself with Spark and classical statistical analysis, I'm trying to execute some samples given into books (neutral statistics books: not dedicated to computing or Spark).为了训练自己使用Spark和经典统计分析,我正在尝试执行一些书籍中的样本(中性统计书籍:不专门用于计算或 Spark)。

The sample in the book offers to calculate the Spearman correlation coefficient of two judges giving a note to ten sportmen:书中的示例提供了计算两名裁判给十名运动员做笔记的斯皮尔曼相关系数:

| | Judge 1 |法官 1 | 8.3 | 8.3 | 7.6 | 7.6 | 9.1 | 9.1 | 9.5 | 9.5 | 8.4 | 8.4 | 6.9 | 6.9 | 9.2 | 9.2 | 7.8 | 7.8 | 8.6 | 8.6 | 8.2 8.2
| | Judge 2 |法官 2 | 7.9 | 7.9 | 7.4 | 7.4 | 9.1 | 9.1 | 9.3 | 9.3 | 8.4 | 8.4 | 7.5 | 7.5 | 9.0 | 9.0 | 7.2 | 7.2 | 8.2 | 8.2 | 8.1 8.1

Creating the intermediate matrix of ranks,创建等级的中间矩阵,
| | Judge 1 |法官 1 | 5 | 5 | 2 | 2 | 8 | 8 | 10 | 10 | 6 | 6 | 1 | 1 | 9 | 9 | 3 | 3 | 7 | 7 | 4 4
| | Judge 2 |法官 2 | 4 | 4 | 2 | 2 | 9 | 9 | 10 | 10 | 7 | 7 | 3 | 3 | 8 | 8 | 1 | 1 | 6 | 6 | 5 5

the sample in the book eventually ends to a result of:书中的示例最终以以下结果结束:

r = 0.915 r = 0.915

I tried to implement it with Spark that way, according to the API documentation of Correlation :根据 Correlation 的 API 文档,我尝试用Spark以这种方式实现它:

List<Row> data = Arrays.asList(
   RowFactory.create(Vectors.dense(8.3, 7.6, 9.1, 9.5, 8.4, 6.9, 9.2, 7.8, 8.6, 8.2)),
   RowFactory.create(Vectors.dense(7.9, 7.4, 9.1, 9.3, 8.4, 7.5, 9.0, 7.2, 8.2, 8.1))
);

StructType schema = new StructType(new StructField[]{
   new StructField("features", new VectorUDT(), false, Metadata.empty()),
});

Dataset<Row> df = this.session.createDataFrame(data, schema);

Row r2 = Correlation.corr(df, "features", "spearman").head();
System.out.println("Spearman correlation matrix:\n" + r2.get(0).toString());

But it doesn't return me a coefficient.但它不会给我一个系数。 Instead, another matrix that seems odd to me:相反,另一个对我来说似乎很奇怪的矩阵:

Spearman correlation matrix:
1.0                  0.9999999999999998   NaN  ... (10 total)
0.9999999999999998   1.0                  NaN  ...
NaN                  NaN                  1.0  ...
0.9999999999999998   0.9999999999999998   NaN  ...
NaN                  NaN                  NaN  ...
-0.9999999999999998  -0.9999999999999998  NaN  ...
0.9999999999999998   0.9999999999999998   NaN  ...
0.9999999999999998   0.9999999999999998   NaN  ...
0.9999999999999998   0.9999999999999998   NaN  ...
0.9999999999999998   0.9999999999999998   NaN  ...

I am new to MLib and not so strong in statistics.我是MLib的新手,在统计方面不太强。 It's clear that I'm doing things wrongly.很明显,我做错了事。

What do I see here, instead of what I've expected,我在这里看到了什么,而不是我所期望的,
and how shall I achieve my wished result?我该如何实现我想要的结果?

A part of the solution of the problem is ashaming...解决问题的一部分是令人羞耻的......
I'd just put the Vectors the wrong side.我只是把向量放在错误的一边。 And this, correct that:而这一点,更正:

List<Row> data = Arrays.asList(
   RowFactory.create(Vectors.dense(8.3, 7.9)),
   RowFactory.create(Vectors.dense(7.6, 7.4)),
   RowFactory.create(Vectors.dense(9.1, 9.1)),
   RowFactory.create(Vectors.dense(9.5, 9.3)),
   RowFactory.create(Vectors.dense(8.4, 8.4)),
   RowFactory.create(Vectors.dense(6.9, 7.5)),
   RowFactory.create(Vectors.dense(9.2, 9.0)),
   RowFactory.create(Vectors.dense(7.8, 7.2)),
   RowFactory.create(Vectors.dense(8.6, 8.2)),
   RowFactory.create(Vectors.dense(8.2, 8.1))
);

Correlation entre les notes des deux juges pour les sportifs: Correlation entre les notes des deux juges pour les sportifs:
1.0 0.9151515151515153 1.0 0.9151515151515153
0.9151515151515153 1.0 0.9151515151515153 1.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM