在 dataframe 中查找 All occurrences of a column values 的 position

Question

df = spark.createDataFrame([("A", "X-X-------------------------------X--X---XX-X--X-------")],["id", "value"])

從上面的 dataframe 中，找到列值中所有出現的X

預計 output：

ID	價值
A	[1, 3, 35, 38, 42, 43, 45, 48]

Answer 1

您可以使用自定義 udf 來實現此目的，例如

from pyspark.sql import functions as F
from pyspark.sql import types as T

@F.udf(T.ArrayType(T.IntegerType()))
def udf_val_indexes(str_val):
    indexes = []
    for index,val in enumerate(str_val):
        if val=="X":
            indexes.append(index+1)
    return indexes

df.withColumn("value",udf_val_indexes(F.col("value"))).show(truncate=False)

+---+------------------------------+
|id |value                         |
+---+------------------------------+
|A  |[1, 3, 35, 38, 42, 43, 45, 48]|
+---+------------------------------+

或按你的角色拆分，並在 posexplode 的幫助下，求和以找到索引，然后將它們分組到一行，如下所示。

注意。 order by 子句有助於維護索引的順序。

from pyspark.sql import functions as F
from pyspark.sql import Window
(
    df.withColumn("val_split",F.split("value","X"))
      .select(
          F.col("id"),
          F.posexplode("val_split") 
      )
      .withColumn("row_pos_to_exclude",F.max("pos").over(Window.partitionBy("id")))
      .filter(F.col("pos") != F.col("row_pos_to_exclude") )
      .withColumn("val_split_len",F.length("col")+1)
      .withColumn(
          "val_split_len",
          F.sum("val_split_len").over(
              Window.partitionBy("id")
                    .orderBy("pos")
                    .rowsBetween(Window.unboundedPreceding, Window.currentRow)
          )
      )
      .withColumn(
          "value",
          F.collect_list("val_split_len").over(
              Window.partitionBy("id")
                    .orderBy("pos")
          )
      )
      .groupBy("id")
      .agg(
          F.max("value").alias("value")
      )
).show(truncate=False)

+---+------------------------------+
|id |value                         |
+---+------------------------------+
|A  |[1, 3, 35, 38, 42, 43, 45, 48]|
+---+------------------------------+

如果這對你有用，請告訴我。

在 dataframe 中查找 All occurrences of a column values 的 position

問題描述

1 個解決方案

解決方案1
0 已采納 2021-09-30 00:39:07

在 dataframe 中查找 All occurrences of a column values 的 position

問題描述

1 個解決方案

解決方案1 0 已采納 2021-09-30 00:39:07

解決方案1
0 已采納 2021-09-30 00:39:07