将 spark 数据帧中的每一行写入单独的 json

Question

I have a fairly large dataframe(million rows), and the requirement is to store each of the row in a separate json file.我有一个相当大的数据框（百万行），要求是将每一行存储在一个单独的 json 文件中。

For this data frame对于这个数据框

 root
 |-- uniqueID: string 
 |-- moreData: array

The output should be stored like below for all the rows.对于所有行，输出应如下所示存储。

s3://.../folder[i]/<uniqueID>.json

where i is the first letter of the uniqueID其中 i 是 uniqueID 的第一个字母

I have looked at other questions and solutions, but they don't satisfy my requirements.我查看了其他问题和解决方案，但它们不符合我的要求。 Trying to do this in a more time optimized way, and from what I have read so far re-partition is not a good option.尝试以更优化时间的方式执行此操作，并且从我目前所阅读的内容来看，重新分区并不是一个好的选择。

Tried writing the df with maxRecordsPerFile option, but I can't seem to control the naming of the files.尝试使用maxRecordsPerFile选项编写 df，但我似乎无法控制文件的命名。

df.write.mode("overwrite")
.option("maxRecordsPerFile", 1)
.json(outputPath)

I am fairly new to spark, any help is much appreciated.我对火花相当陌生，非常感谢任何帮助。

Answer 1

I don't think there is really an optimized (if we take that to mean "much faster than any other") method of doing this.我认为没有真正优化的（如果我们认为这意味着“比任何其他方法都快得多”）方法来做到这一点。 It's fundamentally an inefficient operation, and one that I can't really see a good use case for.从根本上说，这是一种低效的操作，而且我真的看不出一个好的用例。 But, assuming you really have thought this through and decided this is the best way to solve the problem at hand, I would suggest you reconsider using the repartition method on the dataframe;但是，假设您确实已经考虑过了并认为这是解决手头问题的最佳方法，我建议您重新考虑在数据帧上使用repartition方法； it can take a column to be used as the partitioning expression.它可以将一列用作分区表达式。 The only thing it won't do is split files across directories the way you want.它唯一不会做的就是以您想要的方式跨目录拆分文件。

I suppose something like this might work:我想这样的事情可能会奏效：

import java.io.File
import scala.reflect.io.Directory

// dummy data
val df = Seq(("A", "B", "XC"), ("D", "E", "YF"), ("G", "H", "ZI"), ("J", "K", "ZL"), ("M", "N", "XO")).toDF("FOO", "BAR", "BAZ")

// List of all possible prefixes for the index column. If you need to generate this
// from the data, replace this with a query against the input dataframe to do that.
val prefixes = List("X", "Y", "Z")

// replace with your path
val path = "/.../data"

prefixes.foreach{p =>
  val data = df.filter(col("uniqueID").startsWith(p))
  val path = new Directory(new File(f"$path/$p"))
  data.repartition(data.count.toInt) // repartition the dataframe with 1 record per partition
  data.write.format("json").save(path)
}

The above doesn't quite meet the requirement since you can't control the output file name ¹ .以上内容不太符合要求，因为您无法控制输出文件名¹ 。 We can use a shell script to fix the file names afterward.之后我们可以使用 shell 脚本来修复文件名。 This assumes you are running in an environment with bash and jq available.这假设您在bash和jq可用的环境中运行。

#!/usr/bin/env bash

# replace with the path that contains the directories to process
cd /.../data

for sub_data_dir in ./*; do
  cd "${sub_data_dir}"
  rm _SUCCESS
  for f in ./part-*.json; do
    uuid="$(jq -r ."uniqueID" "${f}")"
    mv "${f}" "${uuid}"
  done
  cd ..
done

_{1: Spark doesnt give you an option to control individual file names when using Dataframe.write because that isn't how it is meant to be used.} _{1：在使用Dataframe.write时，Spark 没有为您提供控制单个文件名的选项，因为这不是它的用途。} _{The intended usage is on a multi-node Hadoop cluster where data may be distributed arbitrarily between the nodes.}_{预期用途是在多节点 Hadoop 集群上，其中数据可以在节点之间任意分布。} _{The write operation is coordinated among all nodes and targets a path on the shared HDFS.} _{write操作在所有节点之间协调，并以共享 HDFS 上的路径为目标。} _{In that case it makes no sense to talk about individual files because the operation is performed on the dataframe level, and so you can only control the naming of the directory where the output files will be written (as the argument to the save method)}_{在这种情况下，讨论单个文件是没有意义的，因为操作是在数据帧级别执行的，因此您只能控制将写入输出文件的目录的命名（作为save方法的参数）}

将 spark 数据帧中的每一行写入单独的 json

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-05-30 23:47:26

将 spark 数据帧中的每一行写入单独的 json

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-05-30 23:47:26

解决方案1
0 已采纳 2022-05-30 23:47:26