简体   繁体   English

使用Scala在Spark中对多列执行相同的操作

[英]Same operation on multiple column in Spark using scala

I'm new in scala and Spark and wanted to try some simple concurrent operation on a matrix. 我是scala和Spark的新手,并且想在矩阵上尝试一些简单的并发操作。

I have an [m, 2] matrix and want to divide each element of a column for the last element of that column. 我有一个[m,2]矩阵,想将一列的每个元素划分为该列的最后一个元素。

Here an example of what I want to achieve: 这是我要实现的示例:

 9   25        3  5
 27  10    ->  9  2
 6   15        2  3
 3   5         1  1

I can do this whit a simple for-loop, but wanted to do the operation on the columns simultaneously. 我可以在一个简单的for循环中进行此操作,但希望同时对列进行操作。 It is possible in spark or is better to use scala concurrent? 可能是火花还是最好同时使用scala?

The most important question here is what is your data volume? 这里最重要的问题是您的数据量是多少? Spark is designed to be used on large amounts of data, too large to be processed or even stored on one computer. Spark旨在用于大量数据,这些数据太大而无法处理,甚至无法存储在一台计算机上。 If you are wondering if to do something in Spark or on a single machine in plain scala then you should probably stop considering using Spark (if your data volume is not going to grow in the future). 如果您想在Spark中还是在纯Scala中的单台机器上做某事,那么您可能应该停止考虑使用Spark(如果将来您的数据量不会增长)。

Anyway, assuming for the moment that you are going to have large amounts of data you could do: 无论如何,假设目前您将拥有大量数据,则可以执行以下操作:

import spark.implicits._

val df = Seq(
  (1, 9, 25),
  (2, 27, 10),
  (3, 6, 15),
  (4, 3, 5)
).toDF("id", "n1", "n2")

val lastRow = df.orderBy(col("id").desc).first()

val result = df.withColumn("n1", col("n1") / lastRow.getInt(1))
    .withColumn("n2", col("n2") / lastRow.getInt(2))

result.show()

result: 结果:

+---+---+---+
| id| n1| n2|
+---+---+---+
|  1|3.0|5.0|
|  2|9.0|2.0|
|  3|2.0|3.0|
|  4|1.0|1.0|
+---+---+---+

Please not that this is quite inefficient - even taking the last element is very costly here (not to mention the overhead of launching a Spark job). 请注意,这效率很低-即使在这里采用最后一个元素也非常昂贵(更不用说启动Spark作业的开销了)。 Doing something like this in Spark might be a good idea only when the data volume is large and you are forced to use cluster computing. 仅在数据量很大并且您被迫使用群集计算时,在Spark中执行类似的操作可能是一个好主意。

Here you go: 干得好:

val df = Seq((9, 25), (27, 10), (6, 15), (3, 5)).toDF

        val df_final = df.columns.foldLeft(df) {(tempDF, colName) => {
            tempDF.withColumn(colName, (col(colName) / 
             lit(df.select(colName).collect.last.getInt(0))).cast("Int"))
        }}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM