简体   繁体   English

将RDD [String]拆分为RDD [tuples]

[英]Split RDD[String] to RDD[tuples]

I'm a beginner on Scala and RDD. 我是Scala和RDD的初学者。 I'm using Scala on Spark 2.4. 我在Spark 2.4上使用Scala。 I have a RDD[String] with lines like that: 我有一个RDD [String]像这样的行:

(a, b, c, d, ...)

I would like to split this String at each coma to get an RDD[(String, String, String, ...)] . 我想在每个昏迷中分割此String以获得RDD[(String, String, String, ...)]

Solutions like the following are obviously not possible regarding the number of elements. 就元素数量而言,以下解决方案显然是不可能的。

rdd.map(x => (x.split(",")(0), x.split(",")(1), x.split(",")(2)))

May be is there a way to automate that? 可能有一种自动化的方法吗? Everything working would be fine. 一切正常。

Despite my efforts, I have no solution to my issue so far, 尽管有我的努力,但到目前为止,我仍无法解决问题,

Thanks a lot! 非常感谢!

If the number of elements is fixed, you can do something like: 如果元素的数量是固定的,则可以执行以下操作:

val tuples =
  rdd
    .map(line => line.replaceAll("[\\(\\)]", "").split(","))
    .collect {
      case Array(col1, col2, ..., coln) => (col1, col2, ..., coln)
    }
// tuples: RDD[(String, String, ..., String)]

One solution is to just write the mapping function: 一种解决方案是只编写映射函数:

def parse(s: String) = s.split(",") match {
    case Array(a,b,c) => (a,b,c)
}

parse("x,x,x") // (x,x,x)

You could write the more generic solution using shapeless: 您可以使用shapeless编写更通用的解决方案:

def toTuple[H <: HList](s: String)(implicit ft: FromTraversable[H], t: Tupler[H]) = s.split(",").toHList[H].get.tupled

then you can use it directly: 那么您可以直接使用它:

toTuple[String :: String :: String :: HNil]("x,x,x") // (x,x,x)
toTuple[String :: String :: HNil]("x,x") // (x,x)

or fix then type and then use it: 或修复然后键入然后使用它:

def parse3(s: String) = toTuple[String :: String :: String :: HNil](s)

parse3("x,x,x") // (x,x,x)

Note that the maximum tuple size is limited to 22, so it won't be so long to list them all ... 请注意,最大元组大小限制为22,因此列出所有元组的时间不会太长...

By the way, in the book Spark in Action , on page 110 , it wrotes: 顺便说一句,在第110页的《行动中的火花》一书中,它写道:

There's no elegant way to convert an array to a tuple, so you have to resort to this ugly expression: 没有将数组转换为元组的优雅方法,因此您必须诉诸此丑陋的表达式:

scala> val itPostsRDD = itPostsSplit.map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7), x(8), x(9), x(10), x(11), x(12))
itPostsRDD: org.apache.spark.rdd.RDD[(String, String, ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM