[英]how to shuffle a sparse vector in spark using scala
I have a sparse vector in spark and I want to randomly shuffle (reorder) its contents.我在 spark 中有一个稀疏向量,我想随机打乱(重新排序)它的内容。 This vector is actually a tf-idf vector and what I want is to reorder it so that in my new dataset the features have different order.这个向量实际上是一个 tf-idf 向量,我想要重新排序它,以便在我的新数据集中,特征具有不同的顺序。 is there any way to do this using scala?有没有办法使用 scala 做到这一点? this is my code for generating tf-idf vectors:这是我生成 tf-idf 向量的代码:
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(data).cache()
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("rawFeatures")
.fit(wordsData)
val featurizedData = cvModel.transform(wordsData).cache()
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData).cache()
Perhaps this is useful-也许这很有用-
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
df.show(false)
df.printSchema()
/**
* +---------------------+
* |features |
* +---------------------+
* |(5,[1,3],[1.0,7.0]) |
* |[2.0,0.0,3.0,4.0,5.0]|
* |[4.0,0.0,0.0,6.0,7.0]|
* +---------------------+
*
* root
* |-- features: vector (nullable = true)
*/
val shuffleVector = udf((vector: Vector) =>
Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray)
)
val p = df.withColumn("shuffled_vector", shuffleVector($"features"))
p.show(false)
p.printSchema()
/**
* +---------------------+---------------------+
* |features |shuffled_vector |
* +---------------------+---------------------+
* |(5,[1,3],[1.0,7.0]) |[1.0,0.0,0.0,0.0,7.0]|
* |[2.0,0.0,3.0,4.0,5.0]|[0.0,3.0,2.0,5.0,4.0]|
* |[4.0,0.0,0.0,6.0,7.0]|[4.0,7.0,6.0,0.0,0.0]|
* +---------------------+---------------------+
*
* root
* |-- features: vector (nullable = true)
* |-- shuffled_vector: vector (nullable = true)
*/
You can also use the above udf
to create Transformer
and put it in pipeline你也可以使用上面的udf
创建Transformer
并将其放入管道中
please make sure to use
import org.apache.spark.ml.linalg._
请务必使用import org.apache.spark.ml.linalg._
val shuffleVectorToSparse = udf((vector: Vector) =>
Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray).toSparse
)
val p1 = df.withColumn("shuffled_vector", shuffleVectorToSparse($"features"))
p1.show(false)
p1.printSchema()
/**
* +---------------------+-------------------------------+
* |features |shuffled_vector |
* +---------------------+-------------------------------+
* |(5,[1,3],[1.0,7.0]) |(5,[0,3],[1.0,7.0]) |
* |[2.0,0.0,3.0,4.0,5.0]|(5,[1,2,3,4],[5.0,3.0,2.0,4.0])|
* |[4.0,0.0,0.0,6.0,7.0]|(5,[1,3,4],[7.0,4.0,6.0]) |
* +---------------------+-------------------------------+
*
* root
* |-- features: vector (nullable = true)
* |-- shuffled_vector: vector (nullable = true)
*/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.