獲取單行值的不同計數 Pyspark DataFrame

Question

我正在嘗試將字符串列中的逗號分隔值拆分為各個值並計算每個單獨的值。

我的數據格式如下：

+--------------------+
|                tags|
+--------------------+
|cult, horror, got...|
|            violence|
|            romantic|
|inspiring, romant...|
|cruelty, murder, ...|
|romantic, queer, ...|
|gothic, cruelty, ...|
|mystery, suspense...|
|            violence|
|revenge, neo noir...|
+--------------------+

我希望結果看起來像

+--------------------+-----+
|                tags|count|
+--------------------+-----+
|cult                |    4|
|horror              |   10|
|goth                |    4|
|violence            |   30|
...

我試過但沒有用的代碼如下：

data.select('tags').groupby('tags').count().show(10)

我還使用了一個 countdistinct function 也沒有用。

我覺得我需要一個 function 用逗號分隔值然后列出它們但不確定如何執行它們。

Answer 1

您可以使用split()拆分字符串，然后使用explode() 。 最后，groupby 和計數：

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[
    ["cult,horror"],
    ["cult,comedy"],
    ["romantic,comedy"],
    ["thriler,horror,comedy"],
], schema=["tags"])

df = df \
  .withColumn("tags", F.split("tags", pattern=",")) \
  .withColumn("tags", F.explode("tags"))

df = df.groupBy("tags").count()

[Out]:
+--------+-----+
|tags    |count|
+--------+-----+
|romantic|1    |
|thriler |1    |
|horror  |2    |
|cult    |2    |
|comedy  |3    |
+--------+-----+

獲取單行值的不同計數 Pyspark DataFrame

問題描述

1 個解決方案

解決方案1
0 已采納 2022-11-17 05:29:19

獲取單行值的不同計數 Pyspark DataFrame

問題描述

1 個解決方案

解決方案1 0 已采納 2022-11-17 05:29:19

解決方案1
0 已采納 2022-11-17 05:29:19