简体   繁体   English

创建Spark DataFrame时从hdfs文件传递架构

[英]Pass schema from hdfs file while creating Spark DataFrame

I am trying to read the schema stored in text file in hdfs and use it while creating a DataFrame. 我正在尝试读取存储在hdfs文本文件中的架构,并在创建DataFrame时使用它。

schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
.... and so on

jsonDF = spark.read.schema(schema).json('/path/test.json')

Since the schema is too big I want to defined inside the code. 由于架构太大,因此我想在代码内部进行定义。 Can anyone please suggest which is the best way to do. 任何人都可以建议哪种是最好的方法。

I tried below ways but doesn't work. 我尝试了以下方法,但不起作用。

schema = sc.wholeTextFiles("hdfs://path/sample.schema"))
schema = spark.read.text('/path/sample.schema')

I haven't tested it with hdfs but I assume it is similar to reading from a local file. 我尚未使用hdfs对其进行测试,但我认为它类似于从本地文件读取。 The idea is to store the file as a dict and then parse it to create the desidered schema. 想法是将文件存储为字典,然后解析它以创建所需的架构。 I have taken inspiration from here . 我从这里汲取了灵感。 Currently it lacks support for nullable and I have not tested with deeper levels of nested structs. 当前它不支持可空值,并且我还没有对更深层次的嵌套结构进行测试。

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
import json

spark = SparkSession.builder.appName('myPython').getOrCreate()

f = open("/path/schema_file", "r")

dictString = f.read()

derived_schema = StructType([])

jdata = json.loads(dictString)


def get_type(v):
    if v == "StringType":
        return StringType()
    if v == "TimestampType":
        return TimestampType()
    if v == "IntegerType":
        return IntegerType()


def generate_schema(jdata, derived_schema):
    for k, v in sorted(jdata.items()):
        if (isinstance(v, str)):
            derived_schema.add(StructField(k, get_type(v), True))
        else:
            added_schema = StructType([])
            added_schema = generate_schema(v, added_schema)
            derived_schema.add(StructField(k, added_schema, True))
    return derived_schema


generate_schema(jdata, derived_schema)

from datetime import datetime

data = [("first", "the", datetime.utcnow(), ["as", 1])]

input_df = spark.createDataFrame(data, derived_schema)

input_df.printSchema()

With the file being: 文件为:

{
  "col1" : "StringType",
  "col2" : "StringType",
  "col3" : "TimestampType",
  "col4" : {
    "col5" : "StringType",
    "col6" : "IntegerType"
  }
}

I figured out how to do this. 我想出了怎么做。

1. Define the schema of json file 

json.schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",

2. Print the json output

print(sampletmp.json()) 

3. Copy paste the above output to file sample.schema

4. In the code, recreate the schema as below

schema_file = 'path/sample.schema'
schema_json = spark.read.text(schema_file).first()[0]
schema = StructType.fromJson(json.loads(schema_json))

5. Create a DF using above schema

spark.read.schema(schema).json('/path/test.json')

6. Insert the data from DF into Hive table
jsonDF.write.mode("append").insertInto("hivetable")

Referred to the article - https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/ 请参阅文章-https: //szczeles.github.io/Reading-JSON-CSV-and-XML-files-effectively-in-Apache-Spark/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM