简体   繁体   English

将 dataframe 拆分为 Scala Spark 中的多个数据帧

[英]Splitting dataframe into multiple dataframes in Scala Spark

I have below JSON file(details) in hadoop.我在 hadoop 中有以下 JSON 文件(详细信息)。 I am able to read this file from hd fs by using SQL Context read json.我可以使用 SQL 上下文读取 json 从 hd fs 读取此文件。 Then want to split the file into number of files depending on the date and add date to file name (there can be any number of dates in file).然后想要根据日期将文件拆分为多个文件并将日期添加到文件名(文件中可以有任意数量的日期)。

input file Name: details输入文件名称:详细信息

{"Name": "Pam", "Address": "", "Gender":"F", "Date": "2019-09-27 06:47:57"}
{"Name": "David", "Address": "", "Gender":"M", "Date": "2019-09-27 10:47:56"}
{"Name": "Mike", "Address": "", "Gender":"M", "Date": "2019-09-26 08:48:57"}

expected output files:预期 output 文件:

file Name 1: details_20190927文件名1:details_20190927

{"Name": "Pam", "Address": "", "Gender":"F", "Date": "2019-09-27 06:47:57"}
{"Name": "David", "Address": "", "Gender":"M", "Date": "2019-09-27 10:47:56"}

file Name 2: details_20190926文件名2:details_20190926

{"Name": "Mike", "Address": "", "Gender":"M", "Date": "2019-09-26 08:48:57"}

The paths won't be exactly as you have specified them, but you can write the records on different files like this:路径不会与您指定的完全相同,但您可以将记录写入不同的文件,如下所示:

import org.apache.spark.sql.functions._;
import spark.implicits._

val parsed = spark.read.json("details.json")
val repartitioned = parsed.repartition(col("Date"))
val withPartitionValue = parsed.withColumn("PartitionValue", date_format(col("Date"),"yyyyMMdd"))
withPartitionValue.write.partitionBy("PartitionValue").json("/my/output/folder")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM