简体   繁体   English

如何在 Scala 中使用 json4s 将 Spark Dataframe 转换为 JSON?

[英]How to convert Spark Dataframe to JSON using json4s, in Scala?

Trying to convert a dataframe to a JSON string and the output is just {}.尝试将 dataframe 转换为 JSON 字符串,而 output 只是 {}。 Not sure what I'm doing wrong?不确定我做错了什么?

This is just a test but full Dataframe schema I need to use is 800+ columns so I don't want to have to specify each field specifically in the code if possible, Code runs in a locked down corporate environment so I can't write or read files to the system.这只是一个测试,但我需要使用的完整 Dataframe 模式是 800 多列,所以如果可能的话,我不想在代码中专门指定每个字段,代码在锁定的公司环境中运行,所以我不能写或读取文件到系统。 has to be string output only.只能是字符串 output。

import org.json4s.jackson.Serialization.write
import org.json4s.DefaultFormats

implicit val formats = DefaultFormats

val test = spark.sql("SELECT field1, field2, field3 FROM myTable LIMIT 2");

println("Output:");
write(test);


Output:
res12: String = {}

To add insult to injury, I could use the built in toJSON function (from scala.util.parsing.json._) but our corporate environment has set spark.sql.jsonGenerator.ignoreNullFields to True and it can't be changed but the output has to include null fields - hoping json4s can oblige:)雪上加霜的是,我可以使用内置的 toJSON function(来自 scala.util.parsing.json._),但我们的企业环境已将 spark.sql.jsonGenerator.ignoreNullFields 设置为 True,它无法更改,但output 必须包括 null 字段 - 希望 json4s 可以帮助:)

Thanks谢谢

Not sure what I'm doing wrong?不确定我做错了什么?

That's because spark.sql(...) returns a DataFrame, and all instance variables of DataFrame are private, so your parser will basically just ignore them.那是因为spark.sql(...)返回一个 DataFrame,而DataFrame的所有实例变量都是私有的,因此您的解析器基本上会忽略它们。 You can try this:你可以试试这个:

case class PrivateStuff(private val thing: String)

write(PrivateStuff("something"))
// ourputs {}

So you can't just convert a whole DataFrame to JSON, what you can do instead, is to collect your data (which returns Array[Row] or List[Row] ) and try to convert each row into Scala objects, since the result of converting rows to JSON is not probably what you want, and then, use the write function:所以你不能只将整个 DataFrame 转换为 JSON,你可以做的是收集数据(返回Array[Row]List[Row] )并尝试将每一行转换为 Scala 对象,因为结果将行转换为 JSON 可能不是您想要的,然后使用写入 function:

case class YourModel(x1: String, ...)
object YourModel {
  def fromRow(row: Row): Option[YourModel] = // conversion logic here
}

val myData: Array[YourModel] = spark.sql("SELECT ...")
  .collect()
  .map(YourModel.fromRow)
  .collect { case Some(value) => value }

write(myData)

Update更新


After your explanations about the size of the rows, it doesn't make sense to create case classes, you can use the json method of the Row class in order to achieve that (and it doesn't care about spark.sql.jsonGenerator.ignoreNullFields ):在解释了行的大小之后,创建案例类没有意义,您可以使用Row class 的json方法来实现这一点(它不关心spark.sql.jsonGenerator.ignoreNullFields ):

val test = spark.sql("SELECT field1, field2, field3 FROM myTable LIMIT 2")

val jsonDF = test.map(_.json)

This is a dataframe of JSON objects, you can collect them, save them to files, show them, basically anything you can do with a dataframe.这是一个 dataframe 的 JSON 对象,你可以收集它们,将它们保存到文件中,展示它们,基本上你可以用 dataframe 做任何事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM