Pyspark - 从 json 文件动态创建模式

Question

我在 Databricks 笔记本上使用 Spark 从 API 调用中提取一些数据。

我首先将 API 响应中的所有数据读取到名为 df 的 dataframe 中。 但是，我只需要 API 响应中的几列，而不是全部，还有

我将所需的列及其数据类型存储在 json 文件中

    {
        "structure": [
            {
                "column_name": "column1",
                "column_type": "StringType()"
            },
            {
                "column_name": "column2",
                "column_type": "IntegerType()"
            },
            {
                "column_name": "column3",
                "column_type": "DateType()"
            },
            {
                "column_name": "column4",
                "column_type": "StringType()"
            }
        ]
    }

然后我正在使用以下代码构建架构

with open("/dbfs/mnt/datalake/Dims/shema_json","r") as read_handle:
    file_contents = json.load(read_handle)

struct_fields = []
for column in file_contents.get("structure"):
    struct_fields.append(f'StructField("{column.get("column_name")}",{column.get("column_type")},True)')
new_schema = StructType(struct_fields)

最后，我想使用此代码创建一个 dataframe 具有正确数据类型的所需列

df_staging = spark.createDataFrame(df.rdd,schema = new_schema)

但是，当我这样做时，我收到一条错误消息，说 'str' object has no attribute 'name'

Answer 1

要从 dataframe 获取列的子集，您可以使用简单的 select 与强制转换相结合：

import importlib

cols=[f"cast({c['column_name']} as {getattr(importlib.import_module('pyspark.sql.types'), c['column_type'].replace('()',''))().simpleString()})" for c in file_contents['structure']]

df.selectExpr(*cols).show()

Pyspark - 从 json 文件动态创建模式

问题描述

1 个解决方案

解决方案1
0 2022-09-23 18:05:58

Pyspark - 从 json 文件动态创建模式

问题描述

1 个解决方案

解决方案1 0 2022-09-23 18:05:58

解决方案1
0 2022-09-23 18:05:58