简体   繁体   English

如何聚合Spark数据框以使用Scala获取稀疏向量?

[英]How to aggregate a Spark data frame to get a sparse vector using Scala?

I have a data frame like the one below in Spark, and I want to group it by the id column and then for each line in the grouped data I need to create a sparse vector with elements from the weight column at indices specified by the index column. 我在Spark中有一个类似下面的数据框,我想用id列对它进行分组,然后对于分组数据中的每一行,我需要创建一个稀疏向量,其中包含索引指定的index处的weight列中的元素柱。 The length of the sparse vector is known, say 1000 for this example. 稀疏矢量的长度是已知的,例如1000。

Dataframe df : 数据帧df

+-----+------+-----+
|   id|weight|index|
+-----+------+-----+
|11830|     1|    8|
|11113|     1|    3|
| 1081|     1|    3|
| 2654|     1|    3|
|10633|     1|    3|
|11830|     1|   28|
|11351|     1|   12|
| 2737|     1|   26|
|11113|     3|    2|
| 6590|     1|    2|
+-----+------+-----+

I have read this which is sort of similar of what I want to do, but for a rdd. 我读过这个类似于我想做的事情,但是对于一个rdd。 Does anyone know of a good way to do this for a data frame in Spark using Scala? 有没有人知道使用Scala在Spark中为数据框执行此操作的好方法?

My attempt so far is to first collect the weights and indices as lists like this: 到目前为止,我的尝试是首先收集权重和索引,如下所示:

val dfWithLists = df
    .groupBy("id")
    .agg(collect_list("weight") as "weights", collect_list("index") as "indices"))

which looks like: 看起来像:

+-----+---------+----------+
|   id|  weights|   indices|
+-----+---------+----------+
|11830|   [1, 1]|   [8, 28]|
|11113|   [1, 3]|    [3, 2]|
| 1081|      [1]|       [3]|
| 2654|      [1]|       [3]|
|10633|      [1]|       [3]|
|11351|      [1]|      [12]|
| 2737|      [1]|      [26]|
| 6590|      [1]|       [2]|
+-----+---------+----------+

Then I define a udf and do something like this: 然后我定义一个udf并做这样的事情:

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.functions.udf

def toSparseVector: ((Array[Int], Array[BigInt]) => Vector) = {(a1, a2) => Vectors.sparse(1000, a1, a2.map(x => x.toDouble))}
val udfToSparseVector = udf(toSparseVector)

val dfWithSparseVector = dfWithLists.withColumn("SparseVector", udfToSparseVector($"indices", $"weights"))

but this doesn't seem to work, and it feels like there should be an easier way to do it without needing to collecting the weights and indices to lists first. 但这似乎不起作用,感觉应该有一个更简单的方法来做到这一点,而不需要首先收集权重和索引列表。

I'm pretty new to Spark, Dataframes and Scala, so any help is highly appreciated. 我对Spark,Dataframes和Scala很新,所以任何帮助都非常受欢迎。

You have to collect them as vectors must be local, single machine: https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector 您必须收集它们,因为向量必须是本地的,单机: https//spark.apache.org/docs/latest/mllib-data-types.html#local-vector

For creating the sparse vectors you have 2 options, using unordered (index, value) pairs or specifying the indices and values arrays: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$ 对于创建稀疏向量,您有2个选项,使用无序(索引,值)对或指定索引和值数组: https//spark.apache.org/docs/latest/api/scala/index.html#org。 apache.spark.mllib.linalg.Vectors $

If you can get the data into a different format (pivoted), you could also make use of the VectorAssembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler 如果您可以将数据转换为其他格式(透视),您还可以使用VectorAssembler: https ://spark.apache.org/docs/latest/ml-features.html#vectorassembler

With some small tweaks you can get your approach working: 通过一些小的调整,您可以使您的方法工作:

:paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

val df = Seq((11830,1,8), (11113, 1, 3), (1081, 1,3), (2654, 1, 3), (10633, 1, 3), (11830, 1, 28), (11351, 1, 12), (2737, 1, 26), (11113, 3, 2), (6590, 1, 2)).toDF("id", "weight", "index")

val dfWithFeat = df
  .rdd
  .map(r => (r.getInt(0), (r.getInt(2), r.getInt(1).toDouble)))
  .groupByKey()
  .map(r => LabeledPoint(r._1, Vectors.sparse(1000, r._2.toSeq)))
  .toDS

dfWithFeat.printSchema
dfWithFeat.show(10, false)


// Exiting paste mode, now interpreting.

root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)

+-------+-----------------------+
|label  |features               |
+-------+-----------------------+
|11113.0|(1000,[2,3],[3.0,1.0]) |
|2737.0 |(1000,[26],[1.0])      |
|10633.0|(1000,[3],[1.0])       |
|1081.0 |(1000,[3],[1.0])       |
|6590.0 |(1000,[2],[1.0])       |
|11830.0|(1000,[8,28],[1.0,1.0])|
|2654.0 |(1000,[3],[1.0])       |
|11351.0|(1000,[12],[1.0])      |
+-------+-----------------------+

dfWithFeat: org.apache.spark.sql.Dataset[org.apache.spark.mllib.regression.LabeledPoint] = [label: double, features: vector]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM