从火花数据框中提取列值并将其添加到另一个数据框中

Question

I have a spark dataframe called "df_array" it will always returns a single array as an output like below.我有一个名为“df_array”的 spark 数据帧，它将始终返回一个数组作为输出，如下所示。

arr_value
[M,J,K]

I want to extract it's value and add to another dataframe.我想提取它的值并添加到另一个数据帧。 below is the code I was executing下面是我正在执行的代码

val new_df = old_df.withColumn("new_array_value", df_array.col("UNCP_ORIG_BPR"))

but my code always fails saying "org.apache.spark.sql.AnalysisException: resolved attribute(s)"但我的代码总是失败说“org.apache.spark.sql.AnalysisException：已解决的属性”

Can someone help me on this有人可以帮助我吗

Answer 1

The operation needed here is join这里需要的操作是join

You'll need to have the a common column in both dataframes, which will be used as "key".您需要在两个数据框中都有一个公共列，它将用作“键”。

After the join you can select which columns to be included in the new dataframe.连接后，您可以select要包含在新数据select列。

More detailed can be found here:https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html更详细的可以在这里找到：https ://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

join(other, on=None, how=None)加入（其他，on=None，how=None）

Joins with another DataFrame, using the given join expression.
Parameters: 

    other – Right side of the join
    on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
    how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, leftsemi.

The following performs a full outer join between df1 and df2.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]

Answer 2

If you know the df_array has only one record, you can collect it to driver using first() and then use it as an array of literal values to create a column in any DataFrame:如果您知道df_array只有一条记录，则可以使用first()将其收集到驱动程序，然后将其用作文字值数组以在任何 DataFrame 中创建列：

import org.apache.spark.sql.functions._

// first - collect that single array to driver (assuming array of strings):
val arrValue = df_array.first().getAs[mutable.WrappedArray[String]](0)

// now use lit() function to create a "constant" value column:
val new_df = old_df.withColumn("new_array_value", array(arrValue.map(lit): _*)) 

new_df.show()
// +--------+--------+---------------+
// |old_col1|old_col2|new_array_value|
// +--------+--------+---------------+
// |       1|       a|      [M, J, K]|
// |       2|       b|      [M, J, K]|
// +--------+--------+---------------+

从火花数据框中提取列值并将其添加到另一个数据框中

问题描述

2 个解决方案

解决方案1
1 2017-01-10 11:44:17

解决方案2
1 2017-01-10 11:49:56

从火花数据框中提取列值并将其添加到另一个数据框中

问题描述

2 个解决方案

解决方案1 1 2017-01-10 11:44:17

解决方案2 1 2017-01-10 11:49:56

解决方案1
1 2017-01-10 11:44:17

解决方案2
1 2017-01-10 11:49:56