如何将JSON文件的一部分加载到DataFrame？

Question

I have a file that has contents like this: 我有一个文件，其内容如下：

a {"field1":{"field2":"val","field3":"val"...}}
b {"field1":{"field2":"val","field3":"val"...}}
...

and I was able to load the file to a table like this: 并且我能够将文件加载到这样的表中：

╔════╦════════════════════════════════════════════════
║ ID ║  JSON                                         ║
╠════╬════════════════════════════════════════════════
║  a ║ {"field1":{"field2":"val","field3":"val"...}} ║
║  b ║ {"field1":{"field2":"val","field3":"val"...}} ║
╚════╩════════════════════════════════════════════════

How can I make it into something like this? 我怎样才能做成这样的东西？

╔════╦═════════════════════════════════════
║ ID ║ field2  ║field3 ║...     ║...     ║
╠════╬═════════════════════════════════════
║  a ║ val     ║val    ║..      ║...     ║
║  b ║ val     ║val    ║..      ║...     ║
╚════╩═════════════════════════════════════

Since it is a partial json file, I cannot do read.json I saw this post too convert lines of json in RDD to dataframe in apache Spark But my json string is a nested json and it is very long, so I do not want to list out all the fields. 由于它是部分json文件，因此我无法read.json ，我也看到了这篇文章，也将RDD中的json行转换为apache Spark中的dataframe，但是我的json字符串是嵌套的json，而且它很长，所以我不想列出所有字段。 I also tried 我也试过

#solr_data is the data frame made from the file, and json is the column with the json string, session is a SparkSession
json_table = solr_data.select(solr_data["json"]).rdd.map(lambda x:session.read.json(x))

That did not work well. 效果不好。 I can't show() nor collect() for that, createDataFrame() didn't work for that either. 我既不能show()也不能collect() ， createDataFrame()也不适合。

Answer 1

使用select("JSON.field1.*")将子JSON“解构”为列。

如何将JSON文件的一部分加载到DataFrame？

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-07-20 16:40:54

如何将JSON文件的一部分加载到DataFrame？

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-07-20 16:40:54

解决方案1
0 已采纳 2017-07-20 16:40:54