簡體   English   中英

我有一列(字符串數組),在 PySpark dataframe 中。 如何打破數組並為數組中的每個字符串項創建單獨的行?

[英]I have a column (array of strings), in a PySpark dataframe. How do I break the array and make separate rows for every string item in the array?

我有一個 PySpark dataframe:

df = spark.createDataFrame([
    ("u1", ['u1_row1', 'u1_row2', 'u1_row3']),
    ("u2", ['u2_row1', 'u2_row2']),
    ("u3", ['u3_row1']),
    ],
    ['user_id', 'col_1'])

print(df.printSchema())
df.show()

看起來像:

+-------+--------------------+
|user_id|               col_1|
+-------+--------------------+
|     u1|[u1_row1, u1_row2...|
|     u2|  [u2_row1, u2_row2]|
|     u3|           [u3_row1]|
+-------+--------------------+

現在我希望將 arrays 分解為數組中的每個字符串項都有一個新行。 它應該看起來像

+-------+---------------------------+
|user_id|               col_1_values|
+-------+---------------------------+
|     u1|                    u1_row1|
|     u1|                    u1_row2|
|     u1|                    u1_row3|
|     u2|                    u2_row1|
|     u2|                    u2_row2|
|     u3|                    u3_row1|
+-------+---------------------------+

我如何實現這一目標?

df = df.withColumn('col_1_items', F.explode('col_1'))

它對我有用。

或者,使用explode_outer可以保留具有空 arrays 的行。 例如:

import pyspark.sql.functions as F

df = spark.createDataFrame([
    ("u1", ['u1_row1', 'u1_row2', 'u1_row3']),
    ("u2", ['u2_row1', 'u2_row2']),
    ("u3", []),
    ],
    ['user_id', 'col_1'])
df.show()
+-------+---------------------------+
|user_id|col_1                      |
+-------+---------------------------+
|u1     |[u1_row1, u1_row2, u1_row3]|
|u2     |[u2_row1, u2_row2]         |
|u3     |[]                         |
+-------+---------------------------+

explode = df.withColumn('col_1_items', F.explode('col_1'))
explode.show()
+-------+--------------------+-----------+
|user_id|               col_1|col_1_items|
+-------+--------------------+-----------+
|     u1|[u1_row1, u1_row2...|    u1_row1|
|     u1|[u1_row1, u1_row2...|    u1_row2|
|     u1|[u1_row1, u1_row2...|    u1_row3|
|     u2|  [u2_row1, u2_row2]|    u2_row1|
|     u2|  [u2_row1, u2_row2]|    u2_row2|
+-------+--------------------+-----------+

explode_outer = df.withColumn('col_1_items', F.explode_outer('col_1'))
explode_outer.show()
+-------+--------------------+-----------+
|user_id|               col_1|col_1_items|
+-------+--------------------+-----------+
|     u1|[u1_row1, u1_row2...|    u1_row1|
|     u1|[u1_row1, u1_row2...|    u1_row2|
|     u1|[u1_row1, u1_row2...|    u1_row3|
|     u2|  [u2_row1, u2_row2]|    u2_row1|
|     u2|  [u2_row1, u2_row2]|    u2_row2|
|     u3|                  []|       null|
+-------+--------------------+-----------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM