简体   繁体   English

将JSON文件读入Spark数据集并从单独的Map添加列

[英]Reading JSON files into Spark Dataset and adding columns from a separate Map

Spark 2.1 and Scala 2.11 here. Spark 2.1和Scala 2.11在这里。 I have a large Map[String,Date] that has 10K key/value pairs in it. 我有一个大的Map[String,Date] ,里面有10K键/值对。 I also have 10K JSON files living on a file system that is accessible to Spark: 我还有10K JSON文件存在于Spark可访问的文件系统中:

mnt/
    some/
        path/
            data00001.json
            data00002.json
            data00003.json
            ...
            data10000.json

Each KV pair in the map corresponds to its respective JSON file (hence the 1st map KV pair corresponds to data00001.json , etc.) 映射中的每个KV对对应于其各自的JSON文件(因此第一个映射KV对对应于data00001.json等)

I want to read all these JSON files into 1 large Spark Dataset and, while I'm at it, add two new columns to this dataset (that don't exist in the JSON files). 我想将所有这些JSON文件读入1个大型Spark Dataset集中,当我在它时,向该数据集添加两个新列(JSON文件中不存在)。 Each map key will be the value for the first new column, and each key's value will be the value for the second new column: 每个映射键都是第一个新列的值,每个键的值将是第二个新列的值:

val objectSummaries = getScalaList()
val dataFiles = objectSummaries.filter { _.getKey.endsWith("data.json") }
val dataDirectories = dataFiles.map(dataFile => {
  val keyComponents = dataFile.getKey.split("/")
  val parent = if (keyComponents.length > 1) keyComponents(keyComponents.length - 2) else "/"
  (parent, dataFile.getLastModified)
})

// TODO: How to take each KV pair from dataDirectories above and store them as the values for the
// two new columns?
val allDataDataset = spark.read.json("mnt/some/path/*.json")
  .withColumn("new_col_1", dataDirectories._1)
  .withColumn("new_col_2", dataDirectories._2)

I've confirmed that Spark will honor the wildcard ( mnt/some/path/*.json ) and read all the JSON files into a single Dataset when I remove the withColumn methods and do a allData.show() . 我已经确认Spark将遵循通配符( mnt/some/path/*.json allData.show()并在删除withColumn方法并执行allData.show()时将所有JSON文件读入单个数据集。 So I'm all good there. 所以我在那里都很好。

What I'm struggling with is: how do I add the two new columns and then pluck out all the key/value map elements correctly? 我正在努力的是: 如何添加两个新列,然后正确地拔出所有键/值映射元素?

If I understood correctly you want to correlate a KV from map with dataframes from json files. 如果我理解正确,您想要将地图中的KV与json文件中的数据帧相关联。

I'll try to simplify the problem to only 3 files and 3 key values all ordered. 我将尝试将问题简化为仅3个文件和所有订购的3个键值。

val kvs = Map("a" -> 1, "b" -> 2, "c" -> 3)
val files = List("data0001.json", "data0002.json", "data0003.json")

Define a case class for handling more easy files, key, values 定义一个案例类,用于处理更简单的文件,键,值

case class FileWithKV(fileName: String, key: String, value: Int)

Will zip the files and kvs 将压缩文件和kvs

val filesWithKVs = files.zip(kvs)
  .map(p => FileWithKV(p._1, p._2._1, p._2._2))

It will look like this 它看起来像这样

filesWithKVs: List[FileWithKV] = List(FileWithKV(data0001.json,a,1), FileWithKV(data0002.json,b,2), FileWithKV(data0003.json,c,3))

We start then with an initial dataframe, from the head of our collection and then will start folding left to construct the entire dataframe that will hold all the files, with all the columns dynamically generated from KV 我们从一个初始数据框开始,从我们的集合的头部开始,然后将开始向左折叠以构建将保存所有文件的整个数据框,所有列都是从KV动态生成的

val head = filesWithKVs.head
val initialDf = spark
.read.json(head.filename)
.withColumn(s"new_col_1", lit(head.key)) 
.withColumn(s"new_col_2", lit(head.value))

Now the folding part 现在是折叠部分

val dfAll = filesWithKVs.tail.foldLeft(initialDf)((df, fileWithKV) => {
    val newDf = spark
    .read.json(fileWithKV.filename)
    .withColumn(s"new_col_1", lit(fileWithKV.key)) 
    .withColumn(s"new_col_2", lit(fileWithKV.value))
    // union the dataframes to capture file by file, key value with key value
    df.union(newDf)
})

The dataframe will look like this, assuming that in the json files will be a column named bar and a value foo, for each of the 3 json files 假设在json文件中,对于3个json文件中的每一个,json文件都是名为bar的列和值foo,数据框将如下所示

+---+----------+----------+
|bar|new_col_1 |new_col_2 |
+---+----------+----------+
|foo|         a|         1|
|foo|         b|         2|
|foo|         c|         3|
+---+----------+----------+

I think you should create your own datasource for this. 我认为你应该为此创建自己的数据源。 This new datasource would know about your particular folder structure and content structure. 这个新的数据源将了解您的特定文件夹结构和内容结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM