简体   繁体   English

如何在火花数据框中明智地访问元素列

[英]How to access elements columns wise in spark dataframes

I have a text file which contains the following data我有一个包含以下数据的文本文件
3 5 3 5
10 20 30 40 50 10 20 30 40 50
0 0 0 2 5 0 0 0 2 5
5 10 10 10 10 5 10 10 10 10
Question:问题:
first line of the file will give us number of rows and number of columns of data文件的第一行将为我们提供数据的行数和列数
print the sum of column if every element of the column is not a prime otherwise print zero如果列的每个元素都不是素数,则打印列的总和,否则打印零
Output: Output:
0 0
30 30
40 40
0 0
0 0
Explanation: (0 because column 10 0 5 has prime number 5)解释: (0 因为列 10 0 5 有素数 5)
(30 because column 20 0 10 has no prime number so print 20+0+10=30) like wise apply for all columns. (30 因为第 20 0 10 列没有素数,所以打印 20+0+10=30)同样适用于所有列。
suggest us the method to access the dataframe in column wise manner建议我们以按列方式访问 dataframe 的方法

General idea: Just zip every value with an index, create a pairRDD then apply a reduceByKey (the key here is the index) verifying at each step the number is a prime number.总体思路:只需 zip 每个值都有一个索引,创建一个 pairRDD 然后应用一个 reduceByKey(这里的键是索引)在每一步验证数字是一个素数。

val rows = spark.sparkContext.parallelize(
    Seq(
      Array(10,20,30,40,50),
      Array(0,0,0,2,5),
      Array(5,10,10,10,10)
    )
  )

def isPrime(i: Int): Boolean = i>=2 && ! ((2 until i-1) exists (i % _ == 0))

val result = rows.flatMap{arr => arr.map(Option(_)).zipWithIndex.map(_.swap)}
  .reduceByKey{
    case (None, _) | (_, None) => None
    case (Some(a),Some(b)) if isPrime(a) | isPrime(b) => None
    case (Some(a),Some(b)) => Some(a+b)
}.map{case (k,v) => k -> v.getOrElse(0)}

result.foreach(println)

Output (you'll have to collect data in order to sort by column index): Output(您必须收集数据才能按列索引排序):

(3,0)
(0,0)
(4,0)
(2,40)
(1,30)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM