[英]How to create a Spark-SQL dataframe from JSON file where data and schema are both listed
conf = SparkConf().setAppName("PySpark").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
file = sqlContext.read.json(json_file_path)
file.show()
Outputs:输出:
+--------------------+--------------------+
| data| schema|
+--------------------+--------------------+
|[[The battery is ...|[[[index, integer...|
+--------------------+--------------------+
How do I extract the data using my own created schema.如何使用自己创建的模式提取数据。 My schema code is:我的架构代码是:
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
schema = StructType([
StructField('index', IntegerType(), True),
StructField('content', StringType(), True),
StructField('label', IntegerType(), True),
StructField('label_1', StringType(), True ),
StructField('label_2', StringType(), True ),
StructField('label_3', IntegerType(), True ),
StructField('label_4', IntegerType(), True )])
I have tried:我努力了:
file.withColumn("data", from_json("data", schema))\
.show()
But I receive the following error:但我收到以下错误:
cannot resolve 'from_json(`data`)' due to data type mismatch: argument 1 requires string type, however, '`data`' is of array<struct<content:string,index:bigint,label:bigint,label_1:string,label_2:string,label_3:double,label_4:timestamp>> type.;;
The read
method already recognized the schema in the back. read
方法已经识别出后面的模式。
Try running file.printSchema()
and it should show more-less the schema that you want.尝试运行file.printSchema()
它应该显示更多 - 更少您想要的模式。
The way unpack the data
is to run:解包data
的方式是运行:
file = file.select(explode("data").as("exploded_data"))
If you want, you can take it to next level with:如果您愿意,您可以通过以下方式将其提升到一个新的水平:
file.select(file.col("exploded_data.*"))
This will flatten out the schema.这将使架构变平。
Disclaimer: This is scala code, python might need tiny adjustments免责声明:这是 scala 代码,python 可能需要微调
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.