简体   繁体   English

如何使用 scala 对 Spark 中的稀疏向量进行洗牌

[英]how to shuffle a sparse vector in spark using scala

I have a sparse vector in spark and I want to randomly shuffle (reorder) its contents.我在 spark 中有一个稀疏向量,我想随机打乱(重新排序)它的内容。 This vector is actually a tf-idf vector and what I want is to reorder it so that in my new dataset the features have different order.这个向量实际上是一个 tf-idf 向量,我想要重新排序它,以便在我的新数据集中,特征具有不同的顺序。 is there any way to do this using scala?有没有办法使用 scala 做到这一点? this is my code for generating tf-idf vectors:这是我生成 tf-idf 向量的代码:

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(data).cache()
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("rawFeatures")
  .fit(wordsData)
val featurizedData = cvModel.transform(wordsData).cache()
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData).cache()

Perhaps this is useful-也许这很有用-

Load the test data加载测试数据

 val data = Array(
      Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
    )
    val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
    df.show(false)
    df.printSchema()

    /**
      * +---------------------+
      * |features             |
      * +---------------------+
      * |(5,[1,3],[1.0,7.0])  |
      * |[2.0,0.0,3.0,4.0,5.0]|
      * |[4.0,0.0,0.0,6.0,7.0]|
      * +---------------------+
      *
      * root
      * |-- features: vector (nullable = true)
      */

shuffle the vector洗牌向量

val shuffleVector = udf((vector: Vector) =>
     Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray)
   )

    val p = df.withColumn("shuffled_vector", shuffleVector($"features"))
    p.show(false)
    p.printSchema()

    /**
      * +---------------------+---------------------+
      * |features             |shuffled_vector      |
      * +---------------------+---------------------+
      * |(5,[1,3],[1.0,7.0])  |[1.0,0.0,0.0,0.0,7.0]|
      * |[2.0,0.0,3.0,4.0,5.0]|[0.0,3.0,2.0,5.0,4.0]|
      * |[4.0,0.0,0.0,6.0,7.0]|[4.0,7.0,6.0,0.0,0.0]|
      * +---------------------+---------------------+
      *
      * root
      * |-- features: vector (nullable = true)
      * |-- shuffled_vector: vector (nullable = true)
      */

You can also use the above udf to create Transformer and put it in pipeline你也可以使用上面的udf创建Transformer并将其放入管道中

please make sure to use import org.apache.spark.ml.linalg._请务必使用import org.apache.spark.ml.linalg._

Update-1 convert shuffled vector to sparse Update-1 将混洗向量转换为稀疏向量

 val shuffleVectorToSparse = udf((vector: Vector) =>
      Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray).toSparse
    )

    val p1 = df.withColumn("shuffled_vector", shuffleVectorToSparse($"features"))
    p1.show(false)
    p1.printSchema()

    /**
      * +---------------------+-------------------------------+
      * |features             |shuffled_vector                |
      * +---------------------+-------------------------------+
      * |(5,[1,3],[1.0,7.0])  |(5,[0,3],[1.0,7.0])            |
      * |[2.0,0.0,3.0,4.0,5.0]|(5,[1,2,3,4],[5.0,3.0,2.0,4.0])|
      * |[4.0,0.0,0.0,6.0,7.0]|(5,[1,3,4],[7.0,4.0,6.0])      |
      * +---------------------+-------------------------------+
      *
      * root
      * |-- features: vector (nullable = true)
      * |-- shuffled_vector: vector (nullable = true)
      */

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM