![](/img/trans.png)
[英]How to merge multiple columns having same column name in one dataframe with rules python pandas
[英]Merge multiple columns into one column in pyspark dataframe using python
我需要将数据框的多个列合并为一个列,其中list(或tuple)作为python中使用pyspark的列的值。
Input dataframe:
+-------+-------+-------+-------+-------+
| name |mark1 |mark2 |mark3 | Grade |
+-------+-------+-------+-------+-------+
| Jim | 20 | 30 | 40 | "C" |
+-------+-------+-------+-------+-------+
| Bill | 30 | 35 | 45 | "A" |
+-------+-------+-------+-------+-------+
| Kim | 25 | 36 | 42 | "B" |
+-------+-------+-------+-------+-------+
Output dataframe should be
+-------+-----------------+
| name |marks |
+-------+-----------------+
| Jim | [20,30,40,"C"] |
+-------+-----------------+
| Bill | [30,35,45,"A"] |
+-------+-----------------+
| Kim | [25,36,42,"B"] |
+-------+-----------------+
看看这个文档: https : //spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["mark1", "mark2", "mark3"],
outputCol="marks")
output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)
如果这仍然相关,则可以使用StringIndexer将字符串值编码为浮点替换。
列可以与sparks数组函数合并:
import pyspark.sql.functions as f
columns = [f.col("mark1"), ...]
output = input.withColumn("marks", f.array(columns)).select("name", "marks")
您可能需要更改条目的类型才能使合并成功
您可以在以下选项中执行此操作:
from pyspark.sql.functions import *
df.select( 'name' ,
concat(
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade")
).alias('marks')
)
如果[]必要,可以添加点亮功能。
from pyspark.sql.functions import *
df.select( 'name' ,
concat(lit("["),
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade"), lit("]")
).alias('marks')
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.