[英]How to generate large word count file in Spark?
I want to generate 10 million lines' wordcount file for performance test(each line has the same sentence).我想为性能测试生成 1000 万行的字数文件(每行都有相同的句子)。 But I have no idea about how to code it.
但我不知道如何编码。
You can give me an example code, and save file in HDFS directly.您可以给我一个示例代码,并直接将文件保存在 HDFS 中。
You can try something like this.你可以尝试这样的事情。
Generate 1 column with values from 1 to 100k and one with values from 1 to 100 explode both of them with explode(column).生成 1 列,其值从 1 到 100k 和 1 列,值从 1 到 100 用explode(column) 将它们都炸开。 You can't generate one column with 10 Mil values because kryo buffer is gonna throw an error.
您无法生成一列具有 10 Mil 值的列,因为 kryo 缓冲区会引发错误。
I don't know if this is the best performance way to do it, but it is the fastest way I can think right now.我不知道这是否是最好的性能方式,但这是我现在能想到的最快方式。
val generateList = udf((s: Int) => {
val buf = scala.collection.mutable.ArrayBuffer.empty[Int]
for(i <- 1 to s) {
buf += i
}
buf
})
val someDF = Seq(
("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
).toDF("sentence")
val someDfWithMilColumn = someDF.withColumn("genColumn1", generateList(lit(100000)))
.withColumn("genColumn2", generateList(lit(100)))
val someDfWithMilColumn100k = someDfWithMilColumn
.withColumn("expl_val", explode($"mil")).drop("expl_val", "genColumn1")
val someDfWithMilColumn10mil = someDfWithMilColumn100k
.withColumn("expl_val2", explode($"10")).drop("genColumn2", "expl_val2")
someDfWithMilColumn10mil.write.parquet(path)
You can do it by joining the 2 DFs as below, Also find the code explanation inline.您可以通过加入下面的 2 个 DF 来做到这一点,还可以找到内联的代码说明。
import org.apache.spark.sql.SaveMode
object GenerateTenMils {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
spark.conf.set("spark.sql.crossJoin.enabled","true") // Enable cross join
import spark.implicits._
//Create a DF with your sentence
val df = List("each line has the same sentence").toDF
//Create another Dataset with 10000000 records
spark.range(10000000)
.join(df) // Cross Join the dataframes
.coalesce(1) // Output to a single file
.drop("id") // Drop the extra column
.write
.mode(SaveMode.Overwrite)
.text("src/main/resources/tenMils") // Write as text file
}
}
You could follow this approach.你可以按照这种方法。
Tail recursive to generate the objects list and Dataframes, and Union to generate the big Dataframe尾递归生成对象列表和数据帧,联合生成大 Dataframe
val spark = SparkSession
.builder()
.appName("TenMillionsRows")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","TenMillionsRows") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
/**
* Returns a List of nums sentences
* @param sentence
* @param num
* @return
*/
def getList(sentence: String, num: Int) : List[String] = {
@tailrec
def loop(st: String,n: Int, acc: List[String]): List[String] = {
n match {
case num if num == 0 => acc
case _ => loop(st, n - 1, st :: acc)
}
}
loop(sentence,num,List())
}
/**
* Returns a Dataframe that is the union of nums dataframes
* @param lst
* @param num
* @return
*/
def getDataFrame(lst: List[String], num: Int): DataFrame = {
@tailrec
def loop (ls: List[String],n: Int, acc: DataFrame): DataFrame = {
n match {
case n if n == 0 => acc
case _ => loop(lst,n - 1, acc.union(sc.parallelize(ls).toDF("sentence")))
}
}
loop(lst, num, sc.parallelize(List(sentence)).toDF("sentence"))
}
val sentence = "hope for the best but prepare for the worst"
val lSentence = getList(sentence, 100000)
val dfs = getDataFrame(lSentence,100)
println(dfs.count())
// output: 10000001
dfs.write.orc("path_to_hdfs") // write dataframe to a orc file
// you can save the file as parquet, txt, json .......
// with dataframe.write
Hope this helps.希望这可以帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.