Pyspark - Dynamically create schema from json files

Question

I'm using Spark on Databricks notebooks to ingest some data from API call.

I start off by reading all the data from API response into a dataframe called df. But, I only need to few columns from API response, not all of them and also

I store the required columns and their data types in a json file

    {
        "structure": [
            {
                "column_name": "column1",
                "column_type": "StringType()"
            },
            {
                "column_name": "column2",
                "column_type": "IntegerType()"
            },
            {
                "column_name": "column3",
                "column_type": "DateType()"
            },
            {
                "column_name": "column4",
                "column_type": "StringType()"
            }
        ]
    }

And then I'm building the schema using following code

with open("/dbfs/mnt/datalake/Dims/shema_json","r") as read_handle:
    file_contents = json.load(read_handle)

struct_fields = []
for column in file_contents.get("structure"):
    struct_fields.append(f'StructField("{column.get("column_name")}",{column.get("column_type")},True)')
new_schema = StructType(struct_fields)

Then finally, I want to create a dataframe with required columns with correct data types using this code

df_staging = spark.createDataFrame(df.rdd,schema = new_schema)

But, when I do this, I get an error message saying 'str' object has no attribute 'name'

Answer 1

To get a subset of columns from a dataframe you can use a simple select combined with cast:

import importlib

cols=[f"cast({c['column_name']} as {getattr(importlib.import_module('pyspark.sql.types'), c['column_type'].replace('()',''))().simpleString()})" for c in file_contents['structure']]

df.selectExpr(*cols).show()

Pyspark - Dynamically create schema from json files

Question

1 answers

solution1
0 2022-09-23 18:05:58

Pyspark - Dynamically create schema from json files

Question

1 answers

solution1 0 2022-09-23 18:05:58

solution1
0 2022-09-23 18:05:58