Scala-使用Spark将JSON文件作为单个字符串读取

Question

I have JSON files describing a table structure. 我有描述表结构的JSON文件。 I want to read each file from S3 as a single String in order to then apply a fromJson() method of apache.spark.sql.types.DataType 我想读从S3每个文件以然后应用一个单独的字符串fromJson()的方法apache.spark.sql.types.DataType

DataType.fromJson(jsonString).asInstanceOf[StructType]

But for now I only managed to read the files into a DataFrame: 但是现在我只设法将文件读入DataFrame中：

 val testJsonData = sqlContext.read.option("multiline", "true").json("/s3Bucket/metrics/metric1.json")

But I don't need a df.schema , instead I need to parse the contents of a JSON string to a StructType. 但是我不需要df.schema ，而是需要将JSON字符串的内容解析为StructType。

The contents of a JSON file: JSON文件的内容：

{
  "type" : "struct",
  "fields" : [ {
    "name" : "metric_name",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "metric_time",
    "type" : "long",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "metric_value",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }]
}

Answer 1

It looks like what you want to use is sc.wholeTextFiles (sc is a SparkContext in this case). 看起来您要使用的是sc.wholeTextFiles （在这种情况下，sc是SparkContext ）。

This results in an RDD[(String, String)] where ._1 is the file name, and ._2 is the entire file content. 这将导致RDD[(String, String)] ，其中._1是文件名， ._2是整个文件内容。 Maybe you can try: 也许您可以尝试：

val files = sc.wholeTextFiles("/s3Bucket/metrics/", 16).toDS()
files.map(DataType.fromJson(_._2).asInstanceOf[StructType])

Which, in theory, would give you an Dataset[StructType] . 从理论上讲，它将为您提供Dataset[StructType] 。 Unfortunately, I'm not finding a similar function in the pure spark sql API, but this may work. 不幸的是，我没有在纯Spark sql API中找到类似的功能，但这可能有用。

Scala-使用Spark将JSON文件作为单个字符串读取

问题描述

1 个解决方案

解决方案1
1 2019-08-21 19:32:44

Scala-使用Spark将JSON文件作为单个字符串读取

问题描述

1 个解决方案

解决方案1 1 2019-08-21 19:32:44

解决方案1
1 2019-08-21 19:32:44