[英]Pyspark : concat columns where the name is given in another one
I have 2 dataframes我有 2 个数据框
df1=
+--------------+
|questions |
+--------------+
|[Q1, Q2] |
|[Q4, Q6, Q7] |
|... |
+---+----------+
df2 =
+--------------------+---+---+---+---+
| Q1| Q2| Q3| Q4| Q6| Q7 | ... |Q25|
+--------------------+---+---+---+---+
| 1| 0| 1| 0| 0| 1 | ... | 1|
+--------------------+---+---+---+---+
I'd like to add in the first dataframe a new colum with the value of all columns defined into df1.questions
.我想在第一个 dataframe 中添加一个新列,其中所有列的值都定义为df1.questions
。
Expected result预期结果
df1 =
+--------------++--------------+
|questions |values
+--------------+---------------+
|[Q1, Q2] |[1, 0] |
|[Q4, Q6, Q7] |[0, 0, 1] |
|... | |
+---+----------++--------------+
When I do当我做
cols_to_link = ['Q1', 'Q2']
df2= df2.select([col for col in cols_to_link])\
df2 = df2.withColumn('value', F.concat_ws(", ", *df2.columns))
the additionnal column is what I want, but I can't do it by mixing dataframes附加列是我想要的,但我不能通过混合数据框来做到这一点
It also works when I'm with df2当我使用 df2 时它也有效
df2 = df2.select([col for col in df1.select('questions').collect()[0][0]])\
df2 = df2.withColumn('value', F.concat_ws(", ", *df2.columns))
But not when I want to go from df1但不是当我想从df1 go
df1= df1\
.withColumn('value', F.concat_ws(", ", *df2.select([col for col in df1.select('questions').collect()])))
Where I'm wrong?我哪里错了?
From my example dataframes,从我的示例数据框,
# df1
+------------+
| questions|
+------------+
| [Q1, Q2]|
|[Q4, Q6, Q7]|
+------------+
# df2
+---+---+---+---+---+---+
| Q1| Q2| Q3| Q4| Q6| Q7|
+---+---+---+---+---+---+
| 1| 0| 1| 0| 0| 1|
+---+---+---+---+---+---+
I have create the vertical dataframe and to join.我已经创建了垂直 dataframe 并加入。 You cannot refer the columns from the other dataframe in general.一般来说,您不能引用其他 dataframe 中的列。
cols = df2.columns
df = df2.rdd.flatMap(lambda row: [[cols[i], row[i]] for i in range(0, len(row))]).toDF(['id', 'values'])
df.show()
+---+------+
| id|values|
+---+------+
| Q1| 1|
| Q2| 0|
| Q3| 1|
| Q4| 0|
| Q6| 0|
| Q7| 1|
+---+------+
df1.join(df, f.expr('array_contains(questions, id)'), 'left') \
.groupBy('questions').agg(f.collect_list('values').alias('values')) \
.show()
+------------+---------+
| questions| values|
+------------+---------+
| [Q1, Q2]| [1, 0]|
|[Q4, Q6, Q7]|[0, 0, 1]|
+------------+---------+
creating dataframe创建 dataframe
a = spark.createDataFrame([
("1", "0", "0","A"),
("1", "0", "2","B"),
("1", "1", "2","C"),
("1", "1", "3","H"),
("1", "2", "2","D"),
("1", "2", "2","E")
], ["val1", "val2", "val3","val4"])
create a list and explode and get counts.创建一个列表并展开并获取计数。
df_a= a.withColumn('arr_val', array(col('val1'),col('val2'),col('val3')) )
df_b = df_a.withColumn('repeats', explode(col('arr_val')) ).\
groupby(['val1','val2','val3','repeats']).count().\
filter(col('count')>1)
df_a
+----+----+----+----+---------+
|val1|val2|val3|val4|arr_val |
+----+----+----+----+---------+
|1 |0 |0 |A |[1, 0, 0]|
|1 |0 |2 |B |[1, 0, 2]|
|1 |1 |2 |C |[1, 1, 2]|
|1 |1 |3 |H |[1, 1, 3]|
|1 |2 |2 |D |[1, 2, 2]|
|1 |2 |2 |E |[1, 2, 2]|
+----+----+----+----+---------+
df_b
+----+----+----+-------+-----+
|val1|val2|val3|repeats|count|
+----+----+----+-------+-----+
| 1| 0| 0| 0| 2|
| 1| 2| 2| 2| 2|
| 1| 1| 3| 1| 2|
| 1| 1| 2| 1| 2|
+----+----+----+-------+-----+
I do feel this unoptimized.我确实觉得这没有优化。
if some can optimize using expr('filter(arr_val, x-> Count(x))')
如果有些人可以使用expr('filter(arr_val, x-> Count(x))')
进行优化
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.