[英]I have a column (array of strings), in a PySpark dataframe. How do I break the array and make separate rows for every string item in the array?
我有一個 PySpark dataframe:
df = spark.createDataFrame([
("u1", ['u1_row1', 'u1_row2', 'u1_row3']),
("u2", ['u2_row1', 'u2_row2']),
("u3", ['u3_row1']),
],
['user_id', 'col_1'])
print(df.printSchema())
df.show()
看起來像:
+-------+--------------------+
|user_id| col_1|
+-------+--------------------+
| u1|[u1_row1, u1_row2...|
| u2| [u2_row1, u2_row2]|
| u3| [u3_row1]|
+-------+--------------------+
現在我希望將 arrays 分解為數組中的每個字符串項都有一個新行。 它應該看起來像
+-------+---------------------------+
|user_id| col_1_values|
+-------+---------------------------+
| u1| u1_row1|
| u1| u1_row2|
| u1| u1_row3|
| u2| u2_row1|
| u2| u2_row2|
| u3| u3_row1|
+-------+---------------------------+
我如何實現這一目標?
df = df.withColumn('col_1_items', F.explode('col_1'))
它對我有用。
或者,使用explode_outer
可以保留具有空 arrays 的行。 例如:
import pyspark.sql.functions as F
df = spark.createDataFrame([
("u1", ['u1_row1', 'u1_row2', 'u1_row3']),
("u2", ['u2_row1', 'u2_row2']),
("u3", []),
],
['user_id', 'col_1'])
df.show()
+-------+---------------------------+
|user_id|col_1 |
+-------+---------------------------+
|u1 |[u1_row1, u1_row2, u1_row3]|
|u2 |[u2_row1, u2_row2] |
|u3 |[] |
+-------+---------------------------+
explode = df.withColumn('col_1_items', F.explode('col_1'))
explode.show()
+-------+--------------------+-----------+
|user_id| col_1|col_1_items|
+-------+--------------------+-----------+
| u1|[u1_row1, u1_row2...| u1_row1|
| u1|[u1_row1, u1_row2...| u1_row2|
| u1|[u1_row1, u1_row2...| u1_row3|
| u2| [u2_row1, u2_row2]| u2_row1|
| u2| [u2_row1, u2_row2]| u2_row2|
+-------+--------------------+-----------+
explode_outer = df.withColumn('col_1_items', F.explode_outer('col_1'))
explode_outer.show()
+-------+--------------------+-----------+
|user_id| col_1|col_1_items|
+-------+--------------------+-----------+
| u1|[u1_row1, u1_row2...| u1_row1|
| u1|[u1_row1, u1_row2...| u1_row2|
| u1|[u1_row1, u1_row2...| u1_row3|
| u2| [u2_row1, u2_row2]| u2_row1|
| u2| [u2_row1, u2_row2]| u2_row2|
| u3| []| null|
+-------+--------------------+-----------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.