简体   繁体   English

格式化连接rdd-Apache Spark

[英]Formatting the join rdd - Apache Spark

I have two key value pair RDD, I join the two rdd's and I saveastext file, here is the code: 我有两个键值对RDD,我加入了两个rdd和saveastext文件,这是代码:

val enKeyValuePair1 = rows_filter6.map(line => (line(8) -> (line(0),line(4),line(10),line(5),line(6),line(14),line(1),line(9),line(12),line(13),line(3),line(15),line(7),line(16),line(2),line(14))))

val enKeyValuePair = DATA.map(line => (line(0) -> (line(2),line(3))))

val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)

val output = final_res.saveAsTextFile("C:/out")

my output is as follows:
(534309,((17999,5161,45005,00000,XYZ,,29.95,0.00),None))

How can i get rid of all the parenthesis? 如何去除所有括号? I want my output as follows: 我希望我的输出如下:

534309,17999,5161,45005,00000,XYZ,,29.95,0.00,None

When outputing to a text file Spark will just use the toString representation of the element in the RDD. 当输出到文本文件时,Spark将仅使用RDD中元素的toString表示形式。 If you want control over the format, then, tou can do one last transform of the data to a String before the call to saveAsTextFile . 如果要控制格式,则tou可以在调用saveAsTextFile之前将数据最后一次转换为String

Luckily the tuples that arise form using the Spark API can be pulled apart using destructuring. 幸运的是,可以使用解构将使用Spark API生成的元组分开。 In your example I'd do: 在您的示例中,我将执行以下操作:

val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val formatted = final_res.map { tuple =>
  val (f1,((f2,f3,f4,f5,f6,f7,f8,f9),f10)) = tuple
  Seq(f1, f2, f3, f4, f5, f6, f7, f8, f9, f10).mkString(",")
}
formatted.saveAsTextFile("C:/out")

The first val line will take the tuple that is passed into the map function and assign the components to the values on the left. 第一条val行将采用传递到map函数的元组,并将组件分配给左侧的值。 The second line creates a temporary Seq with the fields in the order you want displayed and then invokes mkString(",") to join the fields using a comma. 第二行使用要显示的顺序创建一个临时Seq其中包含字段,然后调用mkString(",")以逗号mkString(",")字段。

In cases with fewer fields or you're just hacking away at a problem on the REPL, a slight alternate to the above can also be used by using pattern matching on the partial function passed to map . 在字段较少的情况下,或者您只是想解决REPL上的问题,还可以通过在传递给map的部分函数上使用模式匹配来使用上述方法的替代方案。

simpleJoinedRdd.map { case (key,(left,right)) => s"$key,$left,$right"}}

While that does allow you do make it a single line expression it can throw Exceptions if the data in the RDD don't match the pattern provided, as opposed to the earlier example where the compiler will complain if the tuple parameter cannot be destructured into the expected form. 尽管这样做确实可以使它成为单行表达式,但如果RDD中的数据与提供的模式不匹配,则它可能引发Exception,这与之前的示例相反,在该示例中,如果无法将tuple参数分解为预期形式。

You can do something like this: 您可以执行以下操作:

import scala.collection.JavaConversions._
val output = sc.parallelize(List((534309,((17999,5161,45005,1,"XYZ","",29.95,0.00),None))))
val result = output.map(p => p._1 +=: p._2._1.productIterator.toBuffer += p._2._2)
  .map(p => com.google.common.base.Joiner.on(", ").join(p.iterator))

I used guava to format string but there is porbably scala way of doing this. 我用番石榴来格式化字符串,但是有可能使用scala方法来做到这一点。

do a flatmap before saving. 保存之前先做平面图。 Or, you can write a simple format function and use it in map. 或者,您可以编写一个简单的格式函数并在地图中使用它。 Adding a bit code, just to show how it can be done. 添加一点代码,只是为了展示它是如何完成的。 function formatOnDemand can be anything 函数formatOnDemand可以是任何东西

test = sc.parallelize([(534309,((17999,5161,45005,00000,"XYZ","",29.95,0.00),None))])
print test.collect()
print test.map(formatOnDemand).collect()

def formatOnDemand(t):
    out=[]
    out.append(t[0])
    for tok in t[1][0]:
        out.append(tok)
    out.append(t[1][1])
    return out

>>> 
[(534309, ((17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0), None))]
[[534309, 17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0, None]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM