Spark如何将RDD [Seq [（String，String）]]转换为RDD [（String，String）]

Question

I have a Spark RDD[Seq[(String,String)]] which contains several group of two words. 我有一个Spark RDD [Seq [（String，String）]]，其中包含两个单词的几组。 Now I have to save them to a file in HDFS like this (no matter in which Seq they are): 现在，我必须将它们保存到HDFS这样的文件中（无论它们位于哪个Seq中）：

dog cat
cat mouse
mouse milk

Could someone help me with this? 有人可以帮我吗？ Thanks a lot <3 非常感谢<3

EDIT : Thanks for your help. 编辑：感谢您的帮助。 Here is the solution 这是解决方案

Code 码

val seqTermTermRDD: RDD[Seq[(String, String)]] = ...
val termTermRDD: RDD[(String, String)] = seqTermTermRDD.flatMap(identity)
val combinedTermsRDD: RDD[String] = termTermRDD.map{ case(term1, term2) => term1 + " " + term2 }
combinedTermsRDD.saveAsTextFile(outputFile)

Answer 1

RDDs have a neat function called "flatMap" that will do exactly what you want. RDD具有一个名为“ flatMap”的简洁函数，可以完全满足您的需求。 Think of it as a Map followed by a Flatten (except implemented a little more intelligently)--if the function produces multiple entities, each will be added to the group separately. 可以将其视为一个Map，然后是Flatten（除了以更智能的方式实现）-如果函数产生多个实体，则每个实体将分别添加到组中。 (You can also use this for many other objects in Scala.) （您也可以将其用于Scala中的许多其他对象。）

val seqRDD = sc.parallelize(Seq(Seq(("dog","cat"),("cat","mouse"),("mouse","milk"))),1)
val tupleRDD = seqRDD.flatMap(identity)
tupleRDD.collect  //Array((dog,cat), (cat,mouse), (mouse,milk))

Note that I also use the scala feature identity , because flatMap is looking for a function that turns an object of the RDD's type to a TraversableOnce , which a Seq counts as. 请注意，我还使用了scala功能identity ，因为flatMap正在寻找一个将RDD类型的对象转换为TraversableOnce的函数， Seq会将其计为。

Answer 2

You can also use mkString( sep ) function ( where sep is for separator) on Scala collections. 您还可以在Scala集合上使用mkString( sep )函数（其中sep是分隔符）。 Here are some examples: (note that in your code you would replace the last .collect().mkString("\\n") with saveAsTextFile(filepath) ) to save to Hadoop. 下面是一些示例：（请注意，在您的代码中，您将用saveAsTextFile(filepath)替换最后一个.collect().mkString("\\n") saveAsTextFile(filepath) ）保存到Hadoop。

scala> val rdd = sc.parallelize(Seq(  Seq(("a", "b"), ("c", "d")),  Seq( ("1", "2"), ("3", "4") )      ))
rdd: org.apache.spark.rdd.RDD[Seq[(String, String)]] = ParallelCollectionRDD[6102] at parallelize at <console>:71

scala> rdd.map( _.mkString("\n")) .collect().mkString("\n")
res307: String = 
(a,b)
(c,d)
(1,2)
(3,4)

scala> rdd.map( _.mkString("|")) .collect().mkString("\n")
res308: String = 
(a,b)|(c,d)
(1,2)|(3,4)

scala> rdd.map( _.mkString("\n")).map(_.replace("(", "").replace(")", "").replace(",", " ")) .collect().mkString("\n")
res309: String = 
a b
c d
1 2
3 4

Spark如何将RDD [Seq [（String，String）]]转换为RDD [（String，String）]

问题描述

2 个解决方案

解决方案1
0 已采纳 2016-01-04 16:38:25

解决方案2
0 2016-01-04 17:14:46

Spark如何将RDD [Seq [（String，String）]]转换为RDD [（String，String）]

问题描述

2 个解决方案

解决方案1 0 已采纳 2016-01-04 16:38:25

解决方案2 0 2016-01-04 17:14:46

解决方案1
0 已采纳 2016-01-04 16:38:25

解决方案2
0 2016-01-04 17:14:46