[英]Pair rdd save to parquet file scala
我有RDD[Map[String, String]]
,需要轉換為datframe
以便我可以將數據保存在 map 鍵是列名的parquet
文件中。
例如:
val inputRdf = spark.sparkContext.parallelize(List(Map("city" -> "", "ip" -> "42.106.1.102", "source" -> "PlayStore","createdDate"->"2020-04-21"),
Map("city" -> "delhi", "ip" -> "42.1.15.102", "source" -> "PlayStore","createdDate"->"2020-04-21"),
Map("city" -> "", "ip" -> "42.06.15.102", "source" -> "PlayStore","createdDate"->"2020-04-22")))
Output:
City | ip
Delhi| 1.234
在那里我提供了一些指導來解決您的問題
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object MapToDfParquet {
val spark = SparkSession
.builder()
.appName("MapToDfParquet")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","MapToDfParquet") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
import spark.implicits._
val data = Seq(Map("city" -> "delhi", "ip" -> "42.1.15.102", "source" -> "PlayStore","createdDate"->"2020-04-21"),
Map("city" -> "", "ip" -> "42.06.15.102", "source" -> "PlayStore","createdDate"->"2020-04-22"))
.map( seq => seq.values.mkString(","))
val df = sc.parallelize(data)
.map(str => str.split(","))
.map(arr => (arr(0),arr(1),arr(2),arr(3)))
.toDF("city", "ip","source","createdDate")
df.show(truncate = false)
// by default writes it will write as parquet with snappy compression
// we change this behavior and save as parquet uncompressed
sqlContext.setConf("spark.sql.parquet.compression.codec","uncompressed")
df
.write
.parquet("hdfs://quickstart.cloudera/user/cloudera/parquet")
// To have the opportunity to view the web console of Spark: http://localhost:4040/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped")
spark.stop()
println("SparkSession stopped")
}
}
}
預計 output
+-----+------------+---------+-----------+
|city |ip |source |createdDate|
+-----+------------+---------+-----------+
|delhi|42.1.15.102 |PlayStore|2020-04-21 |
| |42.06.15.102|PlayStore|2020-04-22 |
+-----+------------+---------+-----------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.