簡體   English   中英

如何從 spark 中的字符串數組列創建一個新的子字符串數組 dataframe

[英]How to create a new array of substrings from string array column in a spark dataframe

我有一個火花 dataframe。其中一列是一個數組類型,由一組不同長度的文本字符串組成。 我正在尋找一種方法來添加一個新列,該列是這些字符串的唯一左 8 個字符的數組。

df.printSchema()

root
(...)
 |-- arr_agent: array (nullable = true)
 |    |-- element: string (containsNull = true)

來自“arr_agent”列的示例數據:

["NRCANL2AXXX", "NRCANL2A"]
["UTRONL2U", "BKRBNL2AXXX", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "REUWNL2A002", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "UTRONL2UXXX", "BKRBNL2A"]
["MQBFDEFFYYY", "MQBFDEFFZZZ", "MQBFDEFF"  ]

我需要在新專欄中包含的內容:

["NRCANL2A"]
["UTRONL2U", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "BKRBNL2A"]
["MQBFDEFF"  ]

我已經嘗試定義一個為我做這件事的 udf。

from pyspark.sql import functions as F
from pyspark.sql import types as T

def make_list_of_unique_prefixes(text_array, prefix_length=8):
    out_arr = set(t[0:prefix_length] for t in text_array)
    return(out_arr)

make_list_of_unique_prefixes_udf = F.udf(lambda x,y=8: make_list_of_unique_prefixes(x,y), T.ArrayType(T.StringType()))

但是然后調用:

df.withColumn("arr_prefix8s", F.collect_set( make_list_of_unique_prefixes_udf(F.col("arr_agent") ))) 

拋出錯誤AnalysisException: grouping expressions sequence is empty,

任何提示將不勝感激。 謝謝

您可以使用 spark 2.4+ 中提供的高階函數使用 transform 和 substring 解決此問題,然后采用不同的數組:

from pyspark.sql import functions as F
n = 8
out = df.withColumn("New",F.expr(f"array_distinct(transform(arr_agent,x->substring(x,0,{n})))"))

out.show(truncate=False)

+-----------------------------------------------------+----------------------------------------+
|arr_agent                                            |New                                     |
+-----------------------------------------------------+----------------------------------------+
|[NRCANL2AXXX, NRCANL2A]                              |[NRCANL2A]                              |
|[UTRONL2U, BKRBNL2AXXX, BKRBNL2A]                    |[UTRONL2U, BKRBNL2A]                    |
|[NRCANL2A]                                           |[NRCANL2A]                              |
|[UTRONL2U, REUWNL2A002, BKRBNL2A, REUWNL2A, REUWNL2N]|[UTRONL2U, REUWNL2A, BKRBNL2A, REUWNL2N]|
|[UTRONL2U, UTRONL2UXXX, BKRBNL2A]                    |[UTRONL2U, BKRBNL2A]                    |
|[MQBFDEFFYYY, MQBFDEFFZZZ, MQBFDEFF]                 |[MQBFDEFF]                              |
+-----------------------------------------------------+----------------------------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM