I have a dataframe of format as below
+---+---+------+---+
| sp|sp2|colour|sp3|
+---+---+------+---+
| 0| 1| 1| 0|
| 1| 0| 0| 1|
| 0| 0| 1| 0|
+---+---+------+---+
another dataframe contains coefficients for each column in first dataframe. for example
+------+------+---------+------+
| CE_sp|CE_sp2|CE_colour|CE_sp3|
+------+------+---------+------+
| 0.94| 0.31| 0.11| 0.72|
+------+------+---------+------+
Now I want to add a column to first dataframe which is calculated by adding scores from second dataframe.
for ex.
+---+---+------+---+-----+
| sp|sp2|colour|sp3|Score|
+---+---+------+---+-----+
| 0| 1| 1| 0| 0.42|
| 1| 0| 0| 1| 1.66|
| 0| 0| 1| 0| 0.11|
+---+---+------+---+-----+
ie
r -> row of first dataframe
score = r(0)*CE_sp + r(1)*CE_sp2 + r(2)*CE_colour + r(3)*CE_sp3
There can be n number of columns and order of columns can be different.
Thanks in Advance!!!
Quick and simple:
import org.apache.spark.sql.functions.col
val df = Seq(
(0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)
).toDF("sp","sp2", "colour", "sp3")
val coefs = Map("sp" -> 0.94, "sp2" -> 0.32, "colour" -> 0.11, "sp3" -> 0.72)
val score = df.columns.map(
c => col(c) * coefs.getOrElse(c, 0.0)).reduce(_ + _)
df.withColumn("score", score)
And the same thing in PySpark:
from pyspark.sql.functions import col
df = sc.parallelize([
(0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)
]).toDF(["sp","sp2", "colour", "sp3"])
coefs = {"sp": 0.94, "sp2": 0.32, "colour": 0.11, "sp3": 0.72}
df.withColumn("score", sum(col(c) * coefs.get(c, 0) for c in df.columns))
I believe that there many way to accomplish what you are trying to do. In all cases you don't need that second DataFrame, like I said in the comments.
Here is one way :
import org.apache.spark.ml.feature.{ElementwiseProduct, VectorAssembler}
import org.apache.spark.mllib.linalg.{Vectors,Vector => MLVector}
val df = Seq((0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)).toDF("sp", "sp2", "colour", "sp3")
// Your coefficient represents a dense Vector
val coeffSp = 0.94
val coeffSp2 = 0.31
val coeffColour = 0.11
val coeffSp3 = 0.72
val weightVectors = Vectors.dense(Array(coeffSp, coeffSp2, coeffColour, coeffSp3))
// You can assemble the features with VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(df.columns) // since you need to compute on all your columns
.setOutputCol("features")
// Once these features assembled we can perform an element wise product with the weight vector
val output = assembler.transform(df)
val transformer = new ElementwiseProduct()
.setScalingVec(weightVectors)
.setInputCol("features")
.setOutputCol("weightedFeatures")
// Create an UDF to sum the weighted vectors values
import org.apache.spark.sql.functions.udf
def score = udf((score: MLVector) => { score.toDense.toArray.sum })
// Apply the UDF on the weightedFeatures
val scores = transformer.transform(output).withColumn("score",score('weightedFeatures))
scores.show
// +---+---+------+---+-----------------+-------------------+-----+
// | sp|sp2|colour|sp3| features| weightedFeatures|score|
// +---+---+------+---+-----------------+-------------------+-----+
// | 0| 1| 1| 0|[0.0,1.0,1.0,0.0]|[0.0,0.31,0.11,0.0]| 0.42|
// | 1| 0| 0| 1|[1.0,0.0,0.0,1.0]|[0.94,0.0,0.0,0.72]| 1.66|
// | 0| 0| 1| 0| (4,[2],[1.0])| (4,[2],[0.11])| 0.11|
// +---+---+------+---+-----------------+-------------------+-----+
I hope this helps. Don't hesitate if you have more questions.
Here is a simple solution:
scala> df_wght.show
+-----+------+---------+------+
|ce_sp|ce_sp2|ce_colour|ce_sp3|
+-----+------+---------+------+
| 1| 2| 3| 4|
+-----+------+---------+------+
scala> df.show
+---+---+------+---+
| sp|sp2|colour|sp3|
+---+---+------+---+
| 0| 1| 1| 0|
| 1| 0| 0| 1|
| 0| 0| 1| 0|
+---+---+------+---+
Then we can just do a simple cross join and crossproduct.
val scored = df.join(df_wght).selectExpr("(sp*ce_sp + sp2*ce_sp2 + colour*ce_colour + sp3*ce_sp3) as final_score")
The output:
scala> scored.show
+-----------+
|final_score|
+-----------+
| 5|
| 5|
| 3|
+-----------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.