[英]Pass schema from hdfs file while creating Spark DataFrame
I am trying to read the schema stored in text file in hdfs and use it while creating a DataFrame. 我正在尝试读取存储在hdfs文本文件中的架构,并在创建DataFrame时使用它。
schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
.... and so on
jsonDF = spark.read.schema(schema).json('/path/test.json')
Since the schema is too big I want to defined inside the code. 由于架构太大,因此我想在代码内部进行定义。 Can anyone please suggest which is the best way to do. 任何人都可以建议哪种是最好的方法。
I tried below ways but doesn't work. 我尝试了以下方法,但不起作用。
schema = sc.wholeTextFiles("hdfs://path/sample.schema"))
schema = spark.read.text('/path/sample.schema')
I haven't tested it with hdfs but I assume it is similar to reading from a local file. 我尚未使用hdfs对其进行测试,但我认为它类似于从本地文件读取。 The idea is to store the file as a dict and then parse it to create the desidered schema. 想法是将文件存储为字典,然后解析它以创建所需的架构。 I have taken inspiration from here . 我从这里汲取了灵感。 Currently it lacks support for nullable and I have not tested with deeper levels of nested structs. 当前它不支持可空值,并且我还没有对更深层次的嵌套结构进行测试。
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
import json
spark = SparkSession.builder.appName('myPython').getOrCreate()
f = open("/path/schema_file", "r")
dictString = f.read()
derived_schema = StructType([])
jdata = json.loads(dictString)
def get_type(v):
if v == "StringType":
return StringType()
if v == "TimestampType":
return TimestampType()
if v == "IntegerType":
return IntegerType()
def generate_schema(jdata, derived_schema):
for k, v in sorted(jdata.items()):
if (isinstance(v, str)):
derived_schema.add(StructField(k, get_type(v), True))
else:
added_schema = StructType([])
added_schema = generate_schema(v, added_schema)
derived_schema.add(StructField(k, added_schema, True))
return derived_schema
generate_schema(jdata, derived_schema)
from datetime import datetime
data = [("first", "the", datetime.utcnow(), ["as", 1])]
input_df = spark.createDataFrame(data, derived_schema)
input_df.printSchema()
With the file being: 文件为:
{
"col1" : "StringType",
"col2" : "StringType",
"col3" : "TimestampType",
"col4" : {
"col5" : "StringType",
"col6" : "IntegerType"
}
}
I figured out how to do this. 我想出了怎么做。
1. Define the schema of json file
json.schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
2. Print the json output
print(sampletmp.json())
3. Copy paste the above output to file sample.schema
4. In the code, recreate the schema as below
schema_file = 'path/sample.schema'
schema_json = spark.read.text(schema_file).first()[0]
schema = StructType.fromJson(json.loads(schema_json))
5. Create a DF using above schema
spark.read.schema(schema).json('/path/test.json')
6. Insert the data from DF into Hive table
jsonDF.write.mode("append").insertInto("hivetable")
Referred to the article - https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/ 请参阅文章-https: //szczeles.github.io/Reading-JSON-CSV-and-XML-files-effectively-in-Apache-Spark/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.