[英]New column in pandas DataFrame which counts occurrences of a column value for all values below
[英]Find the position of All occurrences of a column values in dataframe
df = spark.createDataFrame([("A", "X-X-------------------------------X--X---XX-X--X-------")],["id", "value"])
從上面的 dataframe 中,找到列值中所有出現的X
預計 output:
ID | 價值 |
---|---|
A | [1, 3, 35, 38, 42, 43, 45, 48] |
您可以使用自定義 udf 來實現此目的,例如
from pyspark.sql import functions as F
from pyspark.sql import types as T
@F.udf(T.ArrayType(T.IntegerType()))
def udf_val_indexes(str_val):
indexes = []
for index,val in enumerate(str_val):
if val=="X":
indexes.append(index+1)
return indexes
df.withColumn("value",udf_val_indexes(F.col("value"))).show(truncate=False)
+---+------------------------------+
|id |value |
+---+------------------------------+
|A |[1, 3, 35, 38, 42, 43, 45, 48]|
+---+------------------------------+
或按你的角色拆分,並在 posexplode 的幫助下,求和以找到索引,然后將它們分組到一行,如下所示。
注意。 order by 子句有助於維護索引的順序。
from pyspark.sql import functions as F
from pyspark.sql import Window
(
df.withColumn("val_split",F.split("value","X"))
.select(
F.col("id"),
F.posexplode("val_split")
)
.withColumn("row_pos_to_exclude",F.max("pos").over(Window.partitionBy("id")))
.filter(F.col("pos") != F.col("row_pos_to_exclude") )
.withColumn("val_split_len",F.length("col")+1)
.withColumn(
"val_split_len",
F.sum("val_split_len").over(
Window.partitionBy("id")
.orderBy("pos")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
.withColumn(
"value",
F.collect_list("val_split_len").over(
Window.partitionBy("id")
.orderBy("pos")
)
)
.groupBy("id")
.agg(
F.max("value").alias("value")
)
).show(truncate=False)
+---+------------------------------+
|id |value |
+---+------------------------------+
|A |[1, 3, 35, 38, 42, 43, 45, 48]|
+---+------------------------------+
如果這對你有用,請告訴我。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.