简体   繁体   中英

Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe. Is there an equivalent function in spark?

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

Spark has a similar function explode() but it is not entirely identical.

Here is how explode works at a very high level.

>>> from pyspark.sql.functions import explode, col

>>> data = {'A': [1, 2]}

>>> df = spark.createDataFrame(data)

>>> df.show()
 +------+
 |     A|
 +------+
 |[1, 2]|
 +------+

>>> df.select(explode(col('A')).alias('normalized')).show()
+----------+
|normalized|
+----------+
|         1|
|         2|
+----------+

On the other hand you could convert the Spark DataFrame to a Pandas DataFrame using:

  • spark_df.toPandas() --> leverage json_normalize() and then revert back to a Spark DataFrame.

  • To revert back to a Spark DataFrame you would use spark.createDataFrame(pandas_df) .

Please note that this back and forth solution is not ideal as calling toPandas(), results in all records of the DataFrame to be collected (.collect()) to the driver and could result in memory errors when working with larger datasets.

The link below provides more insight on using toPandas(): DF.topandas() throwing error in pyspark

Hope this helps and good luck!

There is no direct counterpart of json_normalize in PySpark. But Spark offers different options. If you have nested objects in a Dataframe like this

one
|_a
|_..
two
|_b
|_..

you can select in Spark the child column as follows:

import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("stackoverflow demo").getOrCreate()
columns = ['id', 'one', 'two']
vals = [
     (1, {"a": False}, {"b": True}),
     (2, {"a": True}, {"b": False})
]
df = spark.createDataFrame(vals, columns)
df.select("one.a", "two.b").show()
+-----+-----+
|    a|    b|
+-----+-----+
|false| true|
| true|false|
+-----+-----+

If you build a flattened list of all nested columns using a recursive "flatten" function from this answer , then we get a flat column structure:

columns = flatten(df.schema)
df.select(columns)

Pandas json_normalize() is really great and it works perfectly in my Jupyter Notebook. But I have problems to get it running with Kafka Structured Streaming. Should this solution also work together with Spark Streaming or is this not possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM