在大型数据集上从长到宽重塑Spark DataFrame

Question

I am trying to reshape my dataframe from long to wide using Spark DataFrame API. 我正在尝试使用Spark DataFrame API将我的数据帧从长到大重塑。 The data set is the collection of questions and answers from student's questionary. 数据集是学生问答的问题和答案的集合。 It's a huge data set and the Q(Question) and A(Answer) approximately range from 1 to 50000. I would like to collect all the possible pairs of Q*A and use them to build columns. 这是一个巨大的数据集，Q（问题）和A（答案）大约在1到50000之间。我想收集所有可能的Q * A对并使用它们来构建列。 If a student answered 1 to Question 1, we assign a value 1 to column 1_1. 如果学生对问题1回答1，我们将值1分配给第1_1列。 Otherwise, we give it a 0. The data set has been de-duplicated on S_ID, Q, A. 否则，我们给它一个0.数据集已在S_ID，Q，A上重复数据删除。

In R, I can simply use dcast in the library reshape2 but I don't know how to do it using Spark. 在R中，我可以简单地在库reshape2中使用dcast，但我不知道如何使用Spark。 I have found the solution to pivot in the below link but it required a fix number of distinct pairs of Q*A. 我已经找到了在下面的链接中转动的解决方案，但它需要修复数量不同的Q * A对。 http://rajasoftware.net/index.php/database/91446/scala-apache-spark-pivot-dataframes-pivot-spark-dataframe http://rajasoftware.net/index.php/database/91446/scala-apache-spark-pivot-dataframes-pivot-spark-dataframe

I also tried concatenating Q and A using user-defined function and them apply crosstab However, I got the below error from the console even though so far I only test my code on a sample data file- 我也尝试使用用户定义的函数连接Q和A并应用交叉表但是，我从控制台得到了以下错误，即使到目前为止我只在示例数据文件上测试我的代码 -

The maximum limit of le6 pairs have been collected, which may not be all of the pairs.  
Please try reducing the amount of distinct items in your columns.

Original Data: 原始数据：

S_ID, Q, A S_ID，Q，A
1, 1, 1 1,1,1
1, 2, 2 1,2,2
1, 3, 3 1,3,3
2, 1, 1 2,1,1
2, 2, 3 2,2,3
2, 3, 4 2,3,4
2, 4, 5 2,4,5

=> After long-to-wide transformation: =>经过长期到全面的转型：

S_ID, QA_1_1, QA_2_2, QA_3_3, QA_2_3, QA_3_4, QA_4_5 S_ID，QA_1_1，QA_2_2，QA_3_3，QA_2_3，QA_3_4，QA_4_5
1, 1, 1, 1, 0, 0, 0 1,1,1,1,0,0,0
2, 1, 0, 0, 1, 1, 1 2,1,0,0,1,1,1

R code.  
library(dplyr); library(reshape2);  
df1 <- df %>% group_by(S_ID, Q, A) %>% filter(row_number()==1) %>% mutate(temp=1)  
df1 %>% dcast(S_ID ~ Q + A, value.var="temp", fill=0)  

Spark code.
val fnConcatenate = udf((x: String, y: String) => {"QA_"+ x +"_" + y})
df1 = df.distinct.withColumn("QA", fnConcatenate($"Q", $"A"))
df2 = stat.crosstab("S_ID", "QA")

Any thought would be appreciated. 任何想法将不胜感激。

Answer 1

What you are trying to do here is faulty by design for two reasons: 您在这里尝试做的是设计错误有两个原因：

You replace sparse data set with a dense one. 您使用密集数据集替换稀疏数据集。 It is expensive both when it comes to memory requirements and computations and it is almost never a good idea when you have a large dataset 在内存需求和计算方面都很昂贵，而且当你拥有一个大型数据集时，它几乎不是一个好主意
You limit ability to process data locally. 您限制了在本地处理数据的能力。 Simplifying things a little bit Spark data frames are just a wrappers around RDD[Row] . 简化一些事情Spark数据框只是RDD[Row]的包装器。 It means that larger the row the less you can place on a single partition and in consequence operations like aggregations are much more expensive and require more network traffic. 这意味着行越大，您在单个分区上放置的越少，因此聚合等操作要昂贵得多，并且需要更多的网络流量。

Wide tables are useful when you have a proper columnar storage when you can implement things like efficient compression or aggregations. 当您可以实现诸如高效压缩或聚合之类的事情时，具有适当的列存储时，宽表非常有用。 From the practical perspective almost everything you can do with wide table can be done with a long one using group / window functions. 从实际角度来看，使用组/窗口功能可以完成使用宽表所做的所有事情。

One thing you can try is to use sparse vector to create wide-like format: 你可以尝试的一件事是使用稀疏向量来创建类似于宽的格式：

import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.max
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.StringIndexer
import sqlContext.implicits._

df.registerTempTable("df")
val dfComb = sqlContext.sql("SELECT s_id, CONCAT(Q, '\t', A) AS qa FROM df")

val indexer = new StringIndexer()
  .setInputCol("qa")
  .setOutputCol("idx")
  .fit(dfComb)

val indexed = indexer.transform(dfComb)

val n = indexed.agg(max("idx")).first.getDouble(0).toInt + 1

val wideLikeDF = indexed
  .select($"s_id", $"idx")
  .rdd
  .map{case Row(s_id: String, idx: Double) => (s_id, idx.toInt)}
  .groupByKey // This assumes no duplicates
  .mapValues(vals => Vectors.sparse(n, vals.map((_, 1.0)).toArray))
  .toDF("id", "qaVec")

Cool part here is you can easily convert it to IndexedRowMatrix and for example compute SVD 这里很酷的部分是你可以轻松地将它转换为IndexedRowMatrix ，例如计算SVD

val mat = new IndexedRowMatrix(wideLikeDF.map{
  // Here we assume that s_id can be mapped directly to Long
  // If not it has to be indexed
  case Row(id: String, qaVec: SparseVector) => IndexedRow(id.toLong, qaVec)
})

val svd = mat.computeSVD(3)

or to RowMatrix and get column statistics or compute Principal Components: 或者到RowMatrix并获取列统计信息或计算主要组件：

val colStats = mat.toRowMatrix.computeColumnSummaryStatistic
val colSims = mat.toRowMatrix.columnSimilarities
val pc = mat.toRowMatrix.computePrincipalComponents(3)

Edit : 编辑：

In Spark 1.6.0+ you can use pivot function. 在Spark 1.6.0+中，您可以使用pivot功能。

在大型数据集上从长到宽重塑Spark DataFrame

问题描述

1 个解决方案

解决方案1
2 2015-08-14 21:25:12

在大型数据集上从长到宽重塑Spark DataFrame

问题描述

1 个解决方案

解决方案1 2 2015-08-14 21:25:12

解决方案1
2 2015-08-14 21:25:12