将 dataframe 拆分为 Scala Spark 中的多个数据帧

Question

I have below JSON file(details) in hadoop.我在 hadoop 中有以下 JSON 文件（详细信息）。 I am able to read this file from hd fs by using SQL Context read json.我可以使用 SQL 上下文读取 json 从 hd fs 读取此文件。 Then want to split the file into number of files depending on the date and add date to file name (there can be any number of dates in file).然后想要根据日期将文件拆分为多个文件并将日期添加到文件名（文件中可以有任意数量的日期）。

input file Name: details输入文件名称：详细信息

{"Name": "Pam", "Address": "", "Gender":"F", "Date": "2019-09-27 06:47:57"}
{"Name": "David", "Address": "", "Gender":"M", "Date": "2019-09-27 10:47:56"}
{"Name": "Mike", "Address": "", "Gender":"M", "Date": "2019-09-26 08:48:57"}

expected output files:预期 output 文件：

file Name 1: details_20190927文件名1：details_20190927

{"Name": "Pam", "Address": "", "Gender":"F", "Date": "2019-09-27 06:47:57"}
{"Name": "David", "Address": "", "Gender":"M", "Date": "2019-09-27 10:47:56"}

file Name 2: details_20190926文件名2：details_20190926

{"Name": "Mike", "Address": "", "Gender":"M", "Date": "2019-09-26 08:48:57"}

Answer 1

The paths won't be exactly as you have specified them, but you can write the records on different files like this:路径不会与您指定的完全相同，但您可以将记录写入不同的文件，如下所示：

import org.apache.spark.sql.functions._;
import spark.implicits._

val parsed = spark.read.json("details.json")
val repartitioned = parsed.repartition(col("Date"))
val withPartitionValue = parsed.withColumn("PartitionValue", date_format(col("Date"),"yyyyMMdd"))
withPartitionValue.write.partitionBy("PartitionValue").json("/my/output/folder")

将 dataframe 拆分为 Scala Spark 中的多个数据帧

问题描述

1 个解决方案

解决方案1
0 2019-09-28 17:23:19

将 dataframe 拆分为 Scala Spark 中的多个数据帧

问题描述

1 个解决方案

解决方案1 0 2019-09-28 17:23:19

解决方案1
0 2019-09-28 17:23:19