简体   繁体   English

Scala,RDD数组[字符串]连接

[英]Scala, RDD Array[string] concatenation

Is there any way I could concatenate three RDD Array[string]? 有什么方法可以连接三个RDD数组[字符串]? I am new to Scala and I'm still learning new technique. 我是Scala的新手,我仍在学习新技术。

I have three RDD Array[string] that looks like this: 我有三个RDD Array [string]看起来像这样:

    RDD1 = ['string1', 'string2', 'string3']
    RDD2 = ['stringa', 'stringb', 'stringc']
    RDD3 = ['stringA', 'stringB', 'stringC']

But the trick is, I need to add first columns into first row. 但是诀窍是,我需要在第一行中添加第一列。 So, after concatenation, it should look like this: 因此,在连接之后,它应该如下所示:

RDD = ['string1', 'stringa', 'stringA'
       'string2', 'stringb', 'stringB'
       'string3', 'stringc', 'stringC']

if i use .union 如果我使用.union

that would just give me this: 那只会给我这个:

['string1', 'string2', 'string3'
 'stringa', 'stringb', 'stringc',
 'stringA', 'stringB', 'stringC']

is there any way to accomplish this? 有没有办法实现这个目标?

In regular scala, you could do it with transpose , like: 在常规scala中,您可以使用transpose ,例如:

Array(r1, r2, r3).transpose.flatten

I'm not very familiar with Spark, but I don't believe transpose is available. 我对Spark不太熟悉,但是我不相信transpose是可用的。 If you know you just need a 3x3, you can get the same result with: 如果你知道你只需要一个3x3,你可以得到相同的结果:

r1 zip r2 zip r3 flatMap {case ((a, b), c) => Array(a,b,c)}

If you need to generalize to any nxn , that's going to require a recursive algorithm. 如果你需要推广到任何nxn ,那就需要一个递归算法。

So you want the first row of three rdds to be together. 所以你想要三个rdds的第一行在一起。 You can do that easily by first doing' zipwithindex' and join based on the index of three of your rdds. 您可以通过先执行“ zipwithindex”并根据三个rdds的索引进行连接来轻松实现此目的。 I am assuming you want them to be in same record because rdd don't have a sense of ordering in them. 我假设你希望它们在同一记录中,因为rdd没有对它们的排序感。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM