简体   繁体   English

使用 pyspark 连接所有列

[英]Concat all columns with pyspark

I need to write a.txt file from a dataframe.我需要从 dataframe 写一个 .txt 文件。 I read that to do that I have to have a dataframe with a single column:我读到要做到这一点,我必须有一个单列的 dataframe: 在此处输入图像描述

I'm trying to do that like this:我正在尝试这样做:

dataframe = dataframe.select(concat(*dataframe.columns).alias("Data"))

But it doesn't work, I think that the unpack of the columns gives some problems.但这不起作用,我认为列的解包会带来一些问题。 And I don't want pass explicitly all the column names.而且我不想明确传递所有列名。 Someone has an idea?有人有想法吗? Thank you谢谢

This is the output after updating code thanks to @Jonathan Lam这是更新代码后的 output 感谢@Jonathan Lam

dataframe.show(truncate = False)
print(*[col(column) for column in dataframe.columns])
dataframe = dataframe.select(concat(*[col(column) for column in dataframe.columns]).alias("Data"))
dataframe.show(truncate = False)

在此处输入图像描述

Finally I found the problem: when concat meets a 'null' value the whole value becomes null.最后我发现了问题:当 concat 遇到“null”值时,整个值变成 null。 So I have to find a way to change that所以我必须想办法改变它

I think your approach should be correct, just test:我认为您的方法应该是正确的,只需测试:

df = spark.createDataFrame(
    [("1", "2", "3")],
    schema=['col1', 'col2', 'col3']
)

df.show(3, False)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |2   |3   |
+----+----+----+

But I am using the pyspark api col :但我正在使用 pyspark api col

df.select(
    func.concat(*[func.col(col) for col in df.columns]).alias('concat')
).show(10, False)
+------+
|concat|
+------+
|123   |
+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM