繁体   English   中英

如何合并 arrays 和数组 scala spark

[英]How to merge arrays in and array scala spark

我想创建一个日期和主题列表。 主题位于特定列中,但附加了一些额外信息,因此我需要删除一些额外信息(因为数据中的数字不是计数值,而是文本中的 position)。

我的想法是生成一个列表 (Date, [(Topic, count), (Topic, count)...])

我生成了以下缩减/映射并使用了简化的数据集:在此处输入图像描述

由于这是一项学校作业,我不能使用 apache sql 和 Dataframes/set。

val rdd1 = sc.textFile("./somedata.csv")
// LOADS DATA FROM CSV FILE

val rdd2 = rdd1.map(l => l.split("\t")).filter(x => x.length > 3)
// COLUMNS ARE SPLIT BY TABS, NEED TO MAKE SURE I HAVE ONLY ELEMENTS FROM COL 1 and 3
//Array[Array[String]] = Array(
    //Array(2015, jaap, arie, piet boosboom,10;arie koekwaus,20;moet dat,9;), 
    //Array(2015, sjaak, trekhaak, pieter jaap,20;krijg nou wat,90;), 
    //Array(2016, "", huh, ja ja,10;nee nee,5;ja ja,4;nee nee,3;), 
    //Array(2018, "", wat, huh huh,69;nou moe,70;ja ja, 12;))



val rdd3 = rdd2.map(x => (x(0)+"\t"+x(3))).map(l => l.split("\t"))
// NEED TO MAKE SURE I HAVE ONLY ELEMENTS FROM COL 1 and 3
//Array[Array[String]] = Array(
    //Array(2015, piet boosboom,10; arie koekwaus,20; moet dat,9;), 
    //Array(2015, pieter jaap,20; krijg nou wat,90;), 
    //Array(2016, ja ja,10; nee nee,5;), 
    //Array(2018, huh huh,69; nou moe,70;))


val rdd4 = rdd3.map(x => (x(0), x(1).split(";")))
// REMOVE JUNK AFTER EACH TOPIC 
// Array[(String, Array[String])] = Array(
//     (2015,Array(piet boosboom,10, arie koekwaus,20, moet dat,9)), 
//     (2015,Array(pieter jaap,20, krijg nou wat,90)), 
//     (2016,Array(ja ja,10, nee nee,5, ja ja,4, nee nee,3)), 
//     (2018,Array(huh huh,69, nou moe,70, ja ja, 12)))

val rdd5 = rdd4.map(x => (x._1, x._2.map(l => l.substring(0,l.indexOf(",")))))
// REMOVE JUNK AFTER EACH TOPIC 
//Array[(String, Array[String])] = Array(
    //(2015,Array(piet boosboom, arie koekwaus, moet dat)), 
    //(2015,Array(pieter jaap, krijg nou wat)), (
    //(2016,Array(ja ja, nee nee, ja ja, nee nee)), 
    //(2018,Array(huh huh, nou moe, ja ja)))


val rdd6 = rdd5.map(x => (x._1, x._2.map(l => (l, 1))))
// MAKE KEY VALUE PAIR OF TOPICS
// Array[(String, Array[(String, Int)])] = Array(
    //(2015,Array((piet boosboom,1), (arie koekwaus,1), (moet dat,1))), 
    //(2015,Array((pieter jaap,1), (krijg nou wat,1))), 
    //(2016,Array((ja ja,1), (nee nee,1), (ja ja,1), (nee nee,1))), 
    //(2018,Array((huh huh,1), (nou moe,1), (ja ja,1))))



val rdd7 = rdd6.map(x => (x._1, List(x._2))).reduceByKey(_:::_) 
//https://stackoverflow.com/questions/32248395/what-is-the-use-of-triple-colons-in-scala?lq=1
// CREATE LIST OF ARRAY OF TOPICS AND CONCATENATE THEM BASED ON DATE
// Array[(String, List[Array[(String, Int)]])] = Array(
    //(2015,List(Array((piet boosboom,1), (arie koekwaus,1), (moet dat,1)), Array((pieter jaap,1), (krijg nou wat,1)))), 
    //(2016,List(Array((ja ja,1), (nee nee,1), (ja ja,1), (nee nee,1)))), 
    //(2018,List(Array((huh huh,1), (nou moe,1), (ja ja,1)))))

问题:在 rdd7(最后一个命令)中,我得到一个包含两个 arrays 的列表,由于某种原因,我无法创建一个连接列表中这两个 arrays 的映射:

在 rdd7 中:

(2015,List(Array((piet boosboom,1), (arie koekwaus,1), (moet dat,1)), Array((pieter jaap,1), (krijg nou wat,1)))), 

我在 rdd8 中想要什么:

(2015,List(Array((piet boosboom,1), (arie koekwaus,1), (moet dat,1), (pieter jaap,1), (krijg nou wat,1))))

我试过这样的事情:

val rdd7 = rdd6.map(x => (x._1, List(x._2))).reduceByKey(_:::_).group(_++_)

但不知何故它不起作用

幸运的是我自己找到了答案!

对于任何可能遇到同样问题的人。 这对我有用:

val rdd8 = rdd7.map(x => (x._1, x._2.flatten)).collect
// Array[(String, List[(String, Int)])] = Array(
//     (2015,List((piet boosboom,1), (arie koekwaus,1), (moet dat,1), (pieter jaap,1), (krijg nou wat,1))), 
//     (2016,List((ja ja,1), (nee nee,1), (ja ja,1), (nee nee,1))), 
//     (2018,List((huh huh,1), (nou moe,1), (ja ja,1))))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM