如何将RDD [String]转换为RDD [（String，String）]？

Question

I got a RDD[String] from a file: 我从文件中获得了RDD[String] ：

val file = sc.textFile("/path/to/myData.txt")

myData's format: myData的格式：

>str1_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
...
FJDLALLLGL //the last line of str1
>str2_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
...
FJDLALLLGL //the last line of str2
>str3_name
...

How should I do to transform the data from file to a structure RDD[(String, String)] ? 如何将数据从文件转换为结构RDD[(String, String)] ？ For instance, 例如，

trancRDD(
(str1_name, ATCGGKFKKVKKFKRLFFVLFLRLFDJKALGFJVKRIKFKVKFGKLRL), 
(str2_name, ATCGGKFKKVKKFKRLFFVLFLRLFDJKALGFJVKRIKFKVKFGKLRL),
...
)

Answer 1

If there's a defined record separator, like ">" indicated above, this could be done using a custom Hadoop configuration: 如果有一个定义的记录分隔符，如上面的“>”，则可以使用自定义的Hadoop配置来完成：

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

val conf = new Configuration
conf.set("textinputformat.record.delimiter", ">")
// genome.txt contains the records provided in the question without the "..."
val dataset = sc.newAPIHadoopFile("./data/genome.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val data = dataset.map(x=>x._2.toString)

Let's have a look at the data 让我们看一下数据

data.collect
res11: Array[String] = 
Array("", "str1_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
FJDLALLLGL 
", "str2_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
FJDLALLLGL
")

We can easily make records out of this String 我们可以轻松地用此字符串制作记录

val records =  data.map{ multiLine => val lines = multiLine.split("\n"); (lines.head, lines.tail)}
records.collect
res14: Array[(String, Array[String])] = Array(("",Array()),
       (str1_name,Array(ATCGGKFKKVKKFKRLFFVLFLRL, FDJKALGFJVKRIKFKVKFGKLRL, FJDLALLLGL)),
       (str2_name,Array(ATCGGKFKKVKKFKRLFFVLFLRL, FDJKALGFJVKRIKFKVKFGKLRL, FJDLALLLGL)))

(use filter to take that first empty record out... exercise for the reader) （使用过滤器将第一个空记录取出来……供读者练习）

如何将RDD [String]转换为RDD [（String，String）]？

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-11-15 23:01:30

如何将RDD [String]转换为RDD [（String，String）]？

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-11-15 23:01:30

解决方案1
1 已采纳 2014-11-15 23:01:30