简体   繁体   English

如何从同时列出数据和架构的 JSON 文件创建 Spark-SQL dataframe

[英]How to create a Spark-SQL dataframe from JSON file where data and schema are both listed

conf = SparkConf().setAppName("PySpark").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

file = sqlContext.read.json(json_file_path)
file.show()

Outputs:输出:

+--------------------+--------------------+
|                data|              schema|
+--------------------+--------------------+
|[[The battery is ...|[[[index, integer...|
+--------------------+--------------------+

How do I extract the data using my own created schema.如何使用自己创建的模式提取数据。 My schema code is:我的架构代码是:

from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
schema = StructType([
    StructField('index', IntegerType(), True),
    StructField('content', StringType(), True),
    StructField('label', IntegerType(), True),
    StructField('label_1', StringType(), True ),
    StructField('label_2', StringType(), True ),
    StructField('label_3', IntegerType(), True ),
    StructField('label_4', IntegerType(), True )])

I have tried:我努力了:

file.withColumn("data", from_json("data", schema))\
    .show()

But I receive the following error:但我收到以下错误:

 cannot resolve 'from_json(`data`)' due to data type mismatch: argument 1 requires string type, however, '`data`' is of array<struct<content:string,index:bigint,label:bigint,label_1:string,label_2:string,label_3:double,label_4:timestamp>> type.;;

The read method already recognized the schema in the back. read方法已经识别出后面的模式。

Try running file.printSchema() and it should show more-less the schema that you want.尝试运行file.printSchema()它应该显示更多 - 更少您想要的模式。

The way unpack the data is to run:解包data的方式是运行:

file = file.select(explode("data").as("exploded_data"))

If you want, you can take it to next level with:如果您愿意,您可以通过以下方式将其提升到一个新的水平:

file.select(file.col("exploded_data.*"))

This will flatten out the schema.这将使架构变平。

Disclaimer: This is scala code, python might need tiny adjustments免责声明:这是 scala 代码,python 可能需要微调

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Spark SQL中从列表创建数据框? - How to create dataframe from list in Spark SQL? json文件架构/对象以触发架构以加载数据框 - json file schema/object to spark schema for loading dataframe 如何在 Spark 中使用用户定义模式创建 DataFrame - How to create an DataFrame with a userdefine schema in Spark 如何覆盖在 Spark 中读取 DataFrame 的镶木地板文件 - How to overwrite a parquet file from where DataFrame is being read in Spark 如何为存储在嵌套 JSON 文件中的数据库模式中的每个表元数据(列名、类型、格式)创建 Pandas dataframe - How to create a Pandas dataframe for each table meta data (Column Name, Type, Format) stored within a Database Schema in nested JSON file 为每个文件创建一个包含架构数据的数据框 - Create a dataframe containing schema data for each file 我们如何加载基于 json 数据创建的 hive 表,以使用 spark.ZAC5C74B64B4AFFB83AZ52EF2FAC1 触发 dataframe? - How can we load a hive table created over json data to spark dataframe using spark.sql? 当基于JSON文件创建DataFrame时,Spark SQL“作业中未指定输入路径” - Spark SQL “No input paths specified in jobs” when create DataFrame based on JSON file 如何将数据帧中的数据保存到 json 文件? - How to save data from dataframe to json file? Spark:使用Map从复杂的数据框架构中获取数据 - Spark: fetch data from complex dataframe schema with map
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM