在PySpark數據框中拆分字符串

Question

我想在PySpark數據框中拆分一列，該列（字符串類型）如下所示：

[{"quantity":25,"type":"coins","balance":35}]
[{"balance":40,"type":"coins","quantity":25}]
[{"quantity":2,"type":"column_breaker","balance":2},{"quantity":2,"type":"row_breaker","balance":2},{"quantity":2,"type":"single_block_breaker","balance":2},{"quantity":1,"type":"rainbow","balance":1},{"quantity":135,"type":"coins","balance":140}]

因此，其中一些具有一組"quantity, type, balance" ，而其中一些具有多個這樣的條目。 我試圖將其視為JSON變量並拆分：

schema = StructType(
[
    StructField('balance', StringType(), True),
    StructField('type', StringType(), True),
    StructField('quantity', StringType(), True)
 ]
 )

temp = merger.withColumn("data", 
from_json("items",schema)).select("items", col('data.*'))
display(temp)

但是它只能將觀察結果分成一組。 我想要一個類似的輸出

balance|quantity|type
   35  |   25   |coins
   40  |   25   |coins
.......

這樣，一組觀測值將拆分為一個觀測值，而多組觀測值將拆分為多個垂直放置的觀測值。

另外，分成多行后，如何識別每個觀察值？ 說，我還有另一個變量是ID，如何分配ID？

Answer 1

如果每行有多個JSON，則可以使用技巧將對象之間的逗號替換為換行，並使用explode函數將其替換為換行。 所以對於這樣的DF：

>>> df.show()
+-----------------+
|            items|
+-----------------+
|         {"a": 1}|
|{"a": 2},{"a": 3}|
+-----------------+

這段代碼可以完成這項工作：

>>> from pyspark.sql.types import ArrayType, StringType
>>> from pyspark.sql.functions import udf, explode
>>> split_jsons = lambda jsons: jsons.replace('},{', '}\n{').split('\n')
>>> df.withColumn('one_json_per_row', udf(split_jsons, ArrayType(StringType()))('items')) \
...    .select(explode('one_json_per_row').alias('item')).show()
+--------+
|    item|
+--------+
|{"a": 1}|
|{"a": 2}|
|{"a": 3}|
+--------+

然后您可以使用常規代碼

Answer 2

您可以使用json庫並使用rdd.flatMap（）將json字符串數組解析和分解為多行

import json

data = [("[{\"quantity\":25,\"type\":\"coins\",\"balance\":35}]",),
         ("[{\"balance\":40,\"type\":\"coins\",\"quantity\":25}]",),
    ("[{\"quantity\":2,\"type\":\"column_breaker\",\"balance\":2},{\"quantity\":2,\"type\":\"row_breaker\",\"balance\":2},{\"quantity\":2,\"type\":\"single_block_breaker\",\"balance\":2},{\"quantity\":1,\"type\":\"rainbow\",\"balance\":1},{\"quantity\":135,\"type\":\"coins\",\"balance\":140}]",)]

schema = StructType([StructField("items", StringType(), True)])
df = spark.createDataFrame(data,schema)

def transformRow(row):
    jsonObj = json.loads(row[0])
    rows = [Row(**item) for item in jsonObj]
    return rows

df.rdd.flatMap(transformRow).toDF().show()

輸出

+-------+--------+--------------------+
|balance|quantity|                type|
+-------+--------+--------------------+
|     35|      25|               coins|
|     40|      25|               coins|
|      2|       2|      column_breaker|
|      2|       2|         row_breaker|
|      2|       2|single_block_breaker|
|      1|       1|             rainbow|
|    140|     135|               coins|
+-------+--------+--------------------+

在PySpark數據框中拆分字符串

問題描述

2 個解決方案

解決方案1
0 2018-12-03 07:49:22

解決方案2
0 2018-12-04 05:28:33

在PySpark數據框中拆分字符串

問題描述

2 個解決方案

解決方案1 0 2018-12-03 07:49:22

解決方案2 0 2018-12-04 05:28:33

解決方案1
0 2018-12-03 07:49:22

解決方案2
0 2018-12-04 05:28:33