简体   繁体   English

Spark:如何将 RDD 转换为要在管道中使用的 Seq

[英]Spark: How to transform a RDD to Seq to be used in pipeline

I want to use the implementation of pipeline in MLlib.我想在 MLlib 中使用管道的实现。 Before, I had a RDD file and pass it to the model creation, but now to use pipeline, there should be sequence of LabeledDocument to be passed to the pipeline.之前,我有一个 RDD 文件并将其传递给模型创建,但现在要使用管道,应该有 LabeledDocument 序列传递给管道。

I have my RDD which is created as follows:我有我的RDD,其创建方式如下:

val data = sc.textFile("/test.csv");
val parsedData = data.map { line =>
        val parts = line.split(',')
        LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail))
        }.cache()

In the pipeline example Spark Programming Guide , the pipeline needs the following data:在管道示例Spark Programming Guide 中,管道需要以下数据:

// Prepare training documents, which are labeled.
val training = sparkContext.parallelize(Seq(
  LabeledDocument(0L, "a b c d e spark", 1.0),
  LabeledDocument(1L, "b d", 0.0),
  LabeledDocument(2L, "spark f g h", 1.0),
  LabeledDocument(3L, "hadoop mapreduce", 0.0),
  LabeledDocument(4L, "b spark who", 1.0),
  LabeledDocument(5L, "g d a y", 0.0),
  LabeledDocument(6L, "spark fly", 1.0),
  LabeledDocument(7L, "was mapreduce", 0.0),
  LabeledDocument(8L, "e spark program", 1.0),
  LabeledDocument(9L, "a e c l", 0.0),
  LabeledDocument(10L, "spark compile", 1.0),
  LabeledDocument(11L, "hadoop software", 0.0)))

I need a way to change my RDD (parsedData) to sequence of LabeledDocuments (like training in the example).我需要一种方法将我的 RDD (parsedData) 更改为 LabeledDocuments 序列(例如示例中的训练)。

I appreciate your help.我感谢您的帮助。

I found an answer to this question.我找到了这个问题的答案。

I can transform my RDD (parsedData) to SchemaRDD which is a sequnce of LabeledDocuments by the following code:我可以通过以下代码将我的 RDD (parsedData) 转换为 SchemaRDD,它是 LabeledDocuments 的序列:

val rddSchema = parsedData.toSchemaRDD;

Now the problem is changed!现在问题变了! I want to split the new rddSchema to training (80%) and test (20%).我想将新的rddSchema 拆分为训练 (80%) 和测试 (20%)。 If I use randomSplit , it returns a Array[RDD[Row]] instead of SchemaRDD .如果我使用randomSplit ,它会返回一个Array[RDD[Row]]而不是SchemaRDD

New problem: How to transform Array[RDD[Row]] to SchemaRDD -- OR -- how to split SchemaRDD , in which the results be SchemaRDDs ?新问题:如何将Array[RDD[Row]] 转换SchemaRDD - 或者 - 如何拆分SchemaRDD ,其中结果为SchemaRDD

I tried following in pyspark-我尝试在 pyspark-

def myFunc(s):
    # words = s.split(",")
    s = re.sub("\"", "", s)
    words = [s for s in s.split(",")]
    val = words[0]
    lbl = 0.0
    if val == 4 or val == "4":
        lbl = 0.0
    elif val == 0 or val == "0":
        lbl = 1.0

    cleanlbl = cleanLine(words[5], True, val)
    # print "cleanlblcleanlbl ",cleanlbl
    return LabeledPoint(lbl, htf.transform(cleanlbl.split(" ")))


sparseList = sc.textFile("hdfs:///stats/training.1600000.processed.noemoticon.csv").map(myFunc)

sparseList.cache()  # Cache data since Logistic Regression is an iterative algorithm.


# for data in dataset:
trainfeats, testfeats = sparseList.randomSplit([0.8, 0.2], 10)

You can split while parsing the data , you can hack into and change as per your need您可以在解析数据的同时进行拆分,您可以根据需要进行入侵和更改

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM