[英]Extract a column value from a spark dataframe and add it to another dataframe
I have a spark dataframe called "df_array" it will always returns a single array as an output like below.我有一个名为“df_array”的 spark 数据帧,它将始终返回一个数组作为输出,如下所示。
arr_value
[M,J,K]
I want to extract it's value and add to another dataframe.我想提取它的值并添加到另一个数据帧。 below is the code I was executing
下面是我正在执行的代码
val new_df = old_df.withColumn("new_array_value", df_array.col("UNCP_ORIG_BPR"))
but my code always fails saying "org.apache.spark.sql.AnalysisException: resolved attribute(s)"但我的代码总是失败说“org.apache.spark.sql.AnalysisException:已解决的属性”
Can someone help me on this有人可以帮助我吗
The operation needed here is join
这里需要的操作是
join
You'll need to have the a common column in both dataframes, which will be used as "key".您需要在两个数据框中都有一个公共列,它将用作“键”。
After the join you can select
which columns to be included in the new dataframe.连接后,您可以
select
要包含在新数据select
列。
More detailed can be found here:https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html更详细的可以在这里找到:https ://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
join(other, on=None, how=None)加入(其他,on=None,how=None)
Joins with another DataFrame, using the given join expression.
Parameters:
other – Right side of the join
on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, leftsemi.
The following performs a full outer join between df1 and df2.
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]
If you know the df_array
has only one record, you can collect it to driver using first()
and then use it as an array of literal values to create a column in any DataFrame:如果您知道
df_array
只有一条记录,则可以使用first()
将其收集到驱动程序,然后将其用作文字值数组以在任何 DataFrame 中创建列:
import org.apache.spark.sql.functions._
// first - collect that single array to driver (assuming array of strings):
val arrValue = df_array.first().getAs[mutable.WrappedArray[String]](0)
// now use lit() function to create a "constant" value column:
val new_df = old_df.withColumn("new_array_value", array(arrValue.map(lit): _*))
new_df.show()
// +--------+--------+---------------+
// |old_col1|old_col2|new_array_value|
// +--------+--------+---------------+
// | 1| a| [M, J, K]|
// | 2| b| [M, J, K]|
// +--------+--------+---------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.