Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

Question

I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe. Is there an equivalent function in spark?

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

Answer 1

Spark has a similar function explode() but it is not entirely identical.

Here is how explode works at a very high level.

>>> from pyspark.sql.functions import explode, col

>>> data = {'A': [1, 2]}

>>> df = spark.createDataFrame(data)

>>> df.show()
 +------+
 |     A|
 +------+
 |[1, 2]|
 +------+

>>> df.select(explode(col('A')).alias('normalized')).show()
+----------+
|normalized|
+----------+
|         1|
|         2|
+----------+

On the other hand you could convert the Spark DataFrame to a Pandas DataFrame using:

spark_df.toPandas() --> leverage json_normalize() and then revert back to a Spark DataFrame.
To revert back to a Spark DataFrame you would use spark.createDataFrame(pandas_df) .

Please note that this back and forth solution is not ideal as calling toPandas(), results in all records of the DataFrame to be collected (.collect()) to the driver and could result in memory errors when working with larger datasets.

The link below provides more insight on using toPandas(): DF.topandas() throwing error in pyspark

Hope this helps and good luck!

Answer 2

There is no direct counterpart of json_normalize in PySpark. But Spark offers different options. If you have nested objects in a Dataframe like this

one
|_a
|_..
two
|_b
|_..

you can select in Spark the child column as follows:

import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("stackoverflow demo").getOrCreate()
columns = ['id', 'one', 'two']
vals = [
     (1, {"a": False}, {"b": True}),
     (2, {"a": True}, {"b": False})
]
df = spark.createDataFrame(vals, columns)
df.select("one.a", "two.b").show()
+-----+-----+
|    a|    b|
+-----+-----+
|false| true|
| true|false|
+-----+-----+

If you build a flattened list of all nested columns using a recursive "flatten" function from this answer , then we get a flat column structure:

columns = flatten(df.schema)
df.select(columns)

Answer 3

Pandas json_normalize() is really great and it works perfectly in my Jupyter Notebook. But I have problems to get it running with Kafka Structured Streaming. Should this solution also work together with Spark Streaming or is this not possible.

Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

Question

2 answers

solution1
3 ACCPTED 2020-01-23 02:57:48

solution2
0 2021-05-13 07:00:54

solution3
0 2021-12-15 17:53:40

Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

Question

2 answers

solution1 3 ACCPTED 2020-01-23 02:57:48

solution2 0 2021-05-13 07:00:54

solution3 0 2021-12-15 17:53:40

solution1
3 ACCPTED 2020-01-23 02:57:48

solution2
0 2021-05-13 07:00:54

solution3
0 2021-12-15 17:53:40