简体   繁体   中英

Extract a column value from a spark dataframe and add it to another dataframe

I have a spark dataframe called "df_array" it will always returns a single array as an output like below.

arr_value
[M,J,K]

I want to extract it's value and add to another dataframe. below is the code I was executing

val new_df = old_df.withColumn("new_array_value", df_array.col("UNCP_ORIG_BPR"))

but my code always fails saying "org.apache.spark.sql.AnalysisException: resolved attribute(s)"

Can someone help me on this

The operation needed here is join

You'll need to have the a common column in both dataframes, which will be used as "key".

After the join you can select which columns to be included in the new dataframe.

More detailed can be found here:https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

join(other, on=None, how=None)

Joins with another DataFrame, using the given join expression.
Parameters: 

    other – Right side of the join
    on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
    how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, leftsemi.

The following performs a full outer join between df1 and df2.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]

If you know the df_array has only one record, you can collect it to driver using first() and then use it as an array of literal values to create a column in any DataFrame:

import org.apache.spark.sql.functions._

// first - collect that single array to driver (assuming array of strings):
val arrValue = df_array.first().getAs[mutable.WrappedArray[String]](0)

// now use lit() function to create a "constant" value column:
val new_df = old_df.withColumn("new_array_value", array(arrValue.map(lit): _*)) 

new_df.show()
// +--------+--------+---------------+
// |old_col1|old_col2|new_array_value|
// +--------+--------+---------------+
// |       1|       a|      [M, J, K]|
// |       2|       b|      [M, J, K]|
// +--------+--------+---------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM