简体   繁体   English

如何在 Spark 中使用用户定义模式创建 DataFrame

[英]How to create an DataFrame with a userdefine schema in Spark

I wanted to create on DataFrame with a specified schema in Python.我想在 Python 中使用指定的模式在 DataFrame 上创建。 Here is the process that i have done so far.这是我到目前为止所做的过程。

  1. I have Sample.parm file, where i have defined schema like as below: Account_type,string,True我有 Sample.parm 文件,我在其中定义了如下架构:Account_type,string,True

  2. I have written python script sample.py to read sample.parm file,generate the schema based on sample.parm file and then generate dataframe based on user defined schema.我已经编写了 python 脚本 sample.py 来读取 sample.parm 文件,根据 sample.parm 文件生成模式,然后根据用户定义的模式生成数据帧。

d d

def schema():
    with open('<path>/sample.parm','r') as parm_file:
        reader=csv.reader(parm_file,delimiter=",")
        filteredSchema = []
        for fieldName in reader:
            if fieldName[1].lower() == "decimal":
               filteredSchema.append([fieldName[0], DecimalType(),fieldName[2]])
            elif fieldName[1].lower() == "string":
               filteredSchema.append([fieldName[0], StringType(),fieldName[2]])
            elif fieldName[1].lower() == "integer":
               filteredSchema.append([fieldName[0], IntegerType(),fieldName[2]])
            elif fieldName[1].lower() == "date":
               filteredSchema.append([fieldName[0], DateType(),fieldName[2]])
            elif fieldName[1].lower() == "byte":
               filteredSchema.append([fieldName[0], ByteType(),fieldName[2]])
            elif fieldName[1].lower() == "boolean":
               filteredSchema.append([fieldName[0], BooleanType(),fieldName[2]])
            elif fieldName[1].lower() == "short":
               filteredSchema.append([fieldName[0], ShortType(),fieldName[2]])
            elif fieldName[1].lower() == "long":
               filteredSchema.append([fieldName[0], LongType(),fieldName[2]])
            elif fieldName[1].lower() == "double":
               filteredSchema.append([fieldName[0], DoubleType(),fieldName[2]])
            elif fieldName[1].lower() == "float":
               filteredSchema.append([fieldName[0], FloatType(),fieldName[2]])
            elif fieldName[1].lower() == "timestamp":
               filteredSchema.append([fieldName[0], TimestampType(),fieldName[2]])
 struct_schema = [StructField(line[0], line[1], line[2]) for line in filteredSchema]
 schema=StructTpe(struct_schema)
 return schema

def create_dataframe(path):
    val=spark.read.schema(schema()).csv(path, sep='\t')
    print(val.take(1))

but getting error like : pyspark.sql.utils.IllegalArgumentException: u'Failed to convert the JSON string \\'{"metadata":{},"name":"account_type","nullable":"True","type":"string"}\\' to a field.'但得到如下错误: pyspark.sql.utils.IllegalArgumentException: u'Failed to convert the JSON string \\'{"metadata":{},"name":"account_type","nullable":"True","type":"string"}\\' to a field.'

can you please anyone help me to figure it out?你能请任何人帮我弄清楚吗? appreciate your help感谢你的帮助

I think JSON build is not correct- the metadata is empty,"type" and "field" are missing.我认为 JSON 构建不正确-元数据为空,缺少“类型”和“字段”。 Please try the following JSON for your schema.请为您的架构尝试以下 JSON。

{"type":"struct","fields":[{"name":"account_type","type":"string","nullable":true,"metadata":{"name":"account_type","scale":0}}]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM