[英]How to load part of JSON file to a DataFrame?
I have a file that has contents like this: 我有一个文件,其内容如下:
a {"field1":{"field2":"val","field3":"val"...}}
b {"field1":{"field2":"val","field3":"val"...}}
...
and I was able to load the file to a table like this: 并且我能够将文件加载到这样的表中:
╔════╦════════════════════════════════════════════════
║ ID ║ JSON ║
╠════╬════════════════════════════════════════════════
║ a ║ {"field1":{"field2":"val","field3":"val"...}} ║
║ b ║ {"field1":{"field2":"val","field3":"val"...}} ║
╚════╩════════════════════════════════════════════════
How can I make it into something like this? 我怎样才能做成这样的东西?
╔════╦═════════════════════════════════════
║ ID ║ field2 ║field3 ║... ║... ║
╠════╬═════════════════════════════════════
║ a ║ val ║val ║.. ║... ║
║ b ║ val ║val ║.. ║... ║
╚════╩═════════════════════════════════════
Since it is a partial json file, I cannot do read.json
I saw this post too convert lines of json in RDD to dataframe in apache Spark But my json string is a nested json and it is very long, so I do not want to list out all the fields. 由于它是部分json文件,因此我无法
read.json
,我也看到了这篇文章,也将RDD中的json行转换为apache Spark中的dataframe,但是我的json字符串是嵌套的json,而且它很长,所以我不想列出所有字段。 I also tried 我也试过
#solr_data is the data frame made from the file, and json is the column with the json string, session is a SparkSession
json_table = solr_data.select(solr_data["json"]).rdd.map(lambda x:session.read.json(x))
That did not work well. 效果不好。 I can't
show()
nor collect()
for that, createDataFrame()
didn't work for that either. 我既不能
show()
也不能collect()
, createDataFrame()
也不适合。
使用select("JSON.field1.*")
将子JSON“解构”为列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.