如何將 csv 轉換為 HDFS 內的鑲木地板文件

Question

我是Big Data的新手，所以Hadoop和hdfs現在對我來說有點消失了，所以我尋求幫助。 現在我有 4 個csv格式的文件，它們位於HDFS集群中，我應該使用Python以PARQUET格式制作它們的 4 個副本，我不知道怎么做。 我希望你能幫助我解決這個不難的問題。

Answer 1

我將您的示例放在Scala代碼中，但在Python中執行幾乎相同。

我也發表了一些評論和一些解釋

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

object ReadCsv {
  val spark = SparkSession
    .builder()
    .appName("ReadCsv")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
    .config("spark.app.id","ReadCsv") // To silence Metrics warning
    .getOrCreate()

  val sqlContext = spark.sqlContext

  def main(args: Array[String]): Unit = {

    Logger.getRootLogger.setLevel(Level.ERROR)

    try {

      val df = sqlContext
        .read
        .csv("/path/directory_to_csv_files/") // Here we read the .csv files
        .cache()
      
      df.repartition(4) // we get four files
          .write
          .parquet("/path/directory_to_parquet_files/") // output format file.parquet.snappy by default
      // if we want parquet uncompressed before write we have to do:
      // sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

      // To have the opportunity to view the web console of Spark: http://localhost:4040/
      println("Type whatever to the console to exit......")
      scala.io.StdIn.readLine()
    } finally {
      spark.stop()
      println("SparkSession stopped")
    }
  }
}

如何將 csv 轉換為 HDFS 內的鑲木地板文件

問題描述

1 個解決方案

解決方案1
0 2020-04-29 10:07:20

如何將 csv 轉換為 HDFS 內的鑲木地板文件

問題描述

1 個解決方案

解決方案1 0 2020-04-29 10:07:20

解決方案1
0 2020-04-29 10:07:20