如何在 Spark 中生成大字数文件？

Question

我想为性能测试生成 1000 万行的字数文件（每行都有相同的句子）。 但我不知道如何编码。

您可以给我一个示例代码，并直接将文件保存在 HDFS 中。

Answer 1

你可以尝试这样的事情。

生成 1 列，其值从 1 到 100k 和 1 列，值从 1 到 100 用explode(column) 将它们都炸开。 您无法生成一列具有 10 Mil 值的列，因为 kryo 缓冲区会引发错误。

我不知道这是否是最好的性能方式，但这是我现在能想到的最快方式。

val generateList = udf((s: Int) => {
    val buf = scala.collection.mutable.ArrayBuffer.empty[Int]
    for(i <- 1 to s) {
        buf += i
    }
    buf
})

val someDF = Seq(
  ("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
).toDF("sentence")

val someDfWithMilColumn = someDF.withColumn("genColumn1", generateList(lit(100000)))
   .withColumn("genColumn2", generateList(lit(100)))
val someDfWithMilColumn100k  = someDfWithMilColumn
   .withColumn("expl_val", explode($"mil")).drop("expl_val", "genColumn1")
val someDfWithMilColumn10mil = someDfWithMilColumn100k
   .withColumn("expl_val2", explode($"10")).drop("genColumn2", "expl_val2")

someDfWithMilColumn10mil.write.parquet(path)

Answer 2

您可以通过加入下面的 2 个 DF 来做到这一点，还可以找到内联的代码说明。

import org.apache.spark.sql.SaveMode

object GenerateTenMils {

  def main(args: Array[String]): Unit = {
    val spark = Constant.getSparkSess
    spark.conf.set("spark.sql.crossJoin.enabled","true") // Enable cross join
    import spark.implicits._

    //Create a DF with your sentence
    val df = List("each line has the same sentence").toDF

    //Create another Dataset with 10000000 records
    spark.range(10000000)
      .join(df)    // Cross Join the dataframes
      .coalesce(1)  // Output to a single file
      .drop("id")       // Drop the extra column
      .write
      .mode(SaveMode.Overwrite)
      .text("src/main/resources/tenMils") // Write as text file
  }

}

Answer 3

你可以按照这种方法。

尾递归生成对象列表和数据帧，联合生成大 Dataframe

  val spark = SparkSession
    .builder()
    .appName("TenMillionsRows")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
    .config("spark.app.id","TenMillionsRows") // To silence Metrics warning
    .getOrCreate()

  val sc = spark.sparkContext

    import spark.implicits._

    /**
      * Returns a List of nums sentences
      * @param sentence
      * @param num
      * @return
      */
    def getList(sentence: String, num: Int) : List[String] = {
      @tailrec
      def loop(st: String,n: Int, acc: List[String]): List[String] = {
        n match {
          case num if num == 0 => acc
          case _ => loop(st, n - 1, st :: acc)
        }
      }
      loop(sentence,num,List())
    }

    /**
      * Returns a Dataframe that is the union of nums dataframes
      * @param lst
      * @param num
      * @return
      */
    def getDataFrame(lst: List[String], num: Int): DataFrame = {
      @tailrec
      def loop (ls: List[String],n: Int, acc: DataFrame): DataFrame = {
        n match {
          case n if n == 0 => acc
          case _ => loop(lst,n - 1, acc.union(sc.parallelize(ls).toDF("sentence")))
        }
      }
      loop(lst, num, sc.parallelize(List(sentence)).toDF("sentence"))
    }

      val sentence = "hope for the best but prepare for the worst"
      val lSentence = getList(sentence, 100000)
      val dfs = getDataFrame(lSentence,100)

      println(dfs.count())
      // output: 10000001
      dfs.write.orc("path_to_hdfs") // write dataframe to a orc file
      // you can save the file as parquet, txt, json ....... 
      // with dataframe.write

希望这可以帮助。

如何在 Spark 中生成大字数文件？

问题描述

3 个解决方案

解决方案1
1 2020-05-31 20:14:17

解决方案2
1 2020-06-02 13:41:50

解决方案3
0 2020-06-02 12:17:46

如何在 Spark 中生成大字数文件？

问题描述

3 个解决方案

解决方案1 1 2020-05-31 20:14:17

解决方案2 1 2020-06-02 13:41:50

解决方案3 0 2020-06-02 12:17:46

解决方案1
1 2020-05-31 20:14:17

解决方案2
1 2020-06-02 13:41:50

解决方案3
0 2020-06-02 12:17:46