简体   繁体   English

pyspark 数据框中是否有类似于 pandas.io.json.json_normalize 的函数

[英]Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe.我想执行类似于 pandas.io.json.json_normalize 是 pyspark 数据帧的操作。 Is there an equivalent function in spark? spark中有等效的功能吗?

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

Spark has a similar function explode() but it is not entirely identical. Spark 有一个类似的功能explode()但它并不完全相同。

Here is how explode works at a very high level.这是explode如何在非常高的水平上工作。

>>> from pyspark.sql.functions import explode, col

>>> data = {'A': [1, 2]}

>>> df = spark.createDataFrame(data)

>>> df.show()
 +------+
 |     A|
 +------+
 |[1, 2]|
 +------+

>>> df.select(explode(col('A')).alias('normalized')).show()
+----------+
|normalized|
+----------+
|         1|
|         2|
+----------+

On the other hand you could convert the Spark DataFrame to a Pandas DataFrame using:另一方面,您可以使用以下方法将 Spark DataFrame 转换为 Pandas DataFrame:

  • spark_df.toPandas() --> leverage json_normalize() and then revert back to a Spark DataFrame. spark_df.toPandas() --> 利用 json_normalize() 然后恢复到 Spark DataFrame。

  • To revert back to a Spark DataFrame you would use spark.createDataFrame(pandas_df) .要恢复到 Spark DataFrame,您可以使用spark.createDataFrame(pandas_df)

Please note that this back and forth solution is not ideal as calling toPandas(), results in all records of the DataFrame to be collected (.collect()) to the driver and could result in memory errors when working with larger datasets.请注意,这种来回解决方案在调用 toPandas() 时并不理想,会导致将 DataFrame 的所有记录 (.collect()) 收集到驱动程序中,并且在处理较大的数据集时可能会导致内存错误。

The link below provides more insight on using toPandas(): DF.topandas() throwing error in pyspark下面的链接提供了有关使用 toPandas() 的更多见解: DF.topandas() throwing error in pyspark

Hope this helps and good luck!希望这会有所帮助,祝你好运!

There is no direct counterpart of json_normalize in PySpark. json_normalize中没有json_normalize直接对应物。 But Spark offers different options.但 Spark 提供了不同的选择。 If you have nested objects in a Dataframe like this如果您在这样的 Dataframe 中嵌套了对象

one
|_a
|_..
two
|_b
|_..

you can select in Spark the child column as follows:您可以在 Spark 中选择子列,如下所示:

import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("stackoverflow demo").getOrCreate()
columns = ['id', 'one', 'two']
vals = [
     (1, {"a": False}, {"b": True}),
     (2, {"a": True}, {"b": False})
]
df = spark.createDataFrame(vals, columns)
df.select("one.a", "two.b").show()
+-----+-----+
|    a|    b|
+-----+-----+
|false| true|
| true|false|
+-----+-----+

If you build a flattened list of all nested columns using a recursive "flatten" function from this answer , then we get a flat column structure:如果您使用此答案中的递归“展平”函数构建所有嵌套列的展平列表,那么我们将获得平展列结构:

columns = flatten(df.schema)
df.select(columns)

Pandas json_normalize() is really great and it works perfectly in my Jupyter Notebook. Pandas json_normalize() 非常棒,它在我的 Jupyter Notebook 中完美运行。 But I have problems to get it running with Kafka Structured Streaming.但是我在让它与 Kafka Structured Streaming 一起运行时遇到了问题。 Should this solution also work together with Spark Streaming or is this not possible.此解决方案是否也应与 Spark Streaming 一起使用,或者这是不可能的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM