[英]Convert array to string in pyspark
这是我的实际代码,它工作正常
df_train_taxrate = (
df_train.groupby(
'Company_code_BUKRS',
'Vendor_Customer_Code_WT_ACCO',
'Expense_GL_HKONT',
'PAN_J_1IPANNO',
'HSN_SAC_HSN_SAC'
).agg(
f.collect_set('Section_WT_QSCOD').alias('Unique_Sectio_Code'),
f.collect_set('WHT_rate_QSATZ').alias('Unique_Wtax_rate')
)
)
但问题是 'Section_WT_QSCOD,WHT_rate_QSATZ 这些是数组,在将 arrays 转换为字符串时,我遇到了错误。
我的代码:
df_train_taxrate = df_train.groupby(
'Company_code_BUKRS',
'Vendor_Customer_Code_WT_ACCO',
'Expense_GL_HKONT',
'PAN_J_1IPANNO',
'HSN_SAC_HSN_SAC'
).agg(
f.collect_set('Section_WT_QSCOD').withColumn(
'Section_WT_QSCOD',
concat_ws(',', 'Unique_Sectio_Code')
),
f.collect_set('WHT_rate_QSATZ').withColumn(
'WHT_rate_QSATZ',
concat_ws(',', 'Unique_W_tax_rate')
)
)
错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable
您需要改用array_join
示例数据
import pyspark.sql.functions as F
data = [
('a', 'x1'),
('a', 'x2'),
('a', 'x3'),
('b', 'y1'),
('b', 'y2')
]
df = spark.createDataFrame(data, ['id', 'val'])
解决方案
result = (
df.
groupby('id').
agg(
F.collect_set(F.col('val')).alias('arr_of_vals')
).
withColumn(
'arr_to_string',
F.array_join(
F.col('arr_of_vals'),
','
)
)
)
result
DataFrame[id: string, arr_of_vals: array<string>, arr_to_string: string]
result.show(truncate=False)
+---+------------+-------------+
|id |arr_of_vals |arr_to_string|
+---+------------+-------------+
|b |[y2, y1] |y2,y1 |
|a |[x1, x3, x2]|x1,x3,x2 |
+---+------------+-------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.