简体   繁体   English

使用Scala和Spark创建RDD并输出到文本文件

[英]Creating RDDs and outputting to text files with Scala and Spark

I apologise for what will probably be a simple question but I'm struggling to get to grips with parsing rdd's with scala/spark. 对于可能是一个简单的问题,我深表歉意,但我正努力解决使用scala / spark解析rdd的问题。 I have an RDD created from a CSV, read in with 我有一个从CSV创建的RDD,使用

    val partitions: RDD[(String, String, String, String, String)] = withoutHeader.mapPartitions(lines => {
            val parser = new CSVParser(',')
            lines.map(line => {
                    val columns = parser.parseLine(line)
                    (columns(0), columns(1), columns(2), columns(3), columns(4))
            })
    })

When I output this to a file with 当我将其输出到文件时

partitions.saveAsTextFile(file)

I get the output with the parentheses on each line. 我得到每行带有括号的输出。 I don't want these parentheses. 我不要这些括号。 I'm struggling in general to understand what is happening here. 我通常很难理解这里发生的事情。 My background is with low level languages and I'm struggling to see through the abstractions to what it's actually doing. 我的背景是使用低级语言,我一直在努力通过抽象来了解其实际功能。 I understand the mappings but it's the output that is escaping me. 我了解映射,但是输出在逃避我。 Can someone either explain to me what is going on in the line (columns(0), columns(1), columns(2), columns(3), columns(4)) or point me to a guide that simply explains what is happening? 有人可以向我解释该行中发生的事情(columns(0), columns(1), columns(2), columns(3), columns(4))还是可以向我指出一个简单地解释什么的指南发生了什么?

My ultimate goal is to be able to manipulate files that are on hdsf in spark to put them in formats suitable for mllib.I'm unimpressed with the spark or scala guides as they look like they have been produced with poorly annotated javadocs and don't really explain anything. 我的最终目标是能够操纵hdsf上的spark文件以将其放入适合mllib的格式。spark或scala指南让我印象深刻,因为它们看起来好像是由注释不佳的javadocs制作的,不能真正解释任何事情。

Thanks in advance. 提前致谢。

Dean 院长

I would just convert your tuple to the string format you want. 我只是将您的元组转换为所需的字符串格式。 For example, to create |-delimited output: 例如,创建|分隔的输出:

partitions.map{ tup => s"${tup._1}|${tup._2}|${tup._3}|${tup._4}|${tup._5}" }

or using pattern matching (which incurs a little more runtime overhead): 或使用模式匹配(这会带来更多的运行时开销):

partitions.map{ case (a,b,c,d,e) => s"$a|$b|$c|$d|$e" }

I'm using the string interpolation feature of Scala (note the s"..." format). 我正在使用Scala的字符串插值功能(请注意s"..."格式)。

Side note, you can simplify your example by just mapping over the RDD as a whole, rather than the individual partitions: 附带说明,您可以通过仅映射整个RDD而不是单个分区来简化示例:

val parser = new CSVParser(',')
val partitions: RDD[(String, String, String, String, String)] = 
  withoutHeader.map { line => 
    val columns = parser.parseLine(line)
    (columns(0), columns(1), columns(2), columns(3), columns(4))
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM