[英]Spark: How to transform a RDD to Seq to be used in pipeline
I want to use the implementation of pipeline in MLlib.我想在 MLlib 中使用管道的实现。 Before, I had a RDD file and pass it to the model creation, but now to use pipeline, there should be sequence of LabeledDocument to be passed to the pipeline.
之前,我有一个 RDD 文件并将其传递给模型创建,但现在要使用管道,应该有 LabeledDocument 序列传递给管道。
I have my RDD which is created as follows:我有我的RDD,其创建方式如下:
val data = sc.textFile("/test.csv");
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail))
}.cache()
In the pipeline example Spark Programming Guide , the pipeline needs the following data:在管道示例Spark Programming Guide 中,管道需要以下数据:
// Prepare training documents, which are labeled.
val training = sparkContext.parallelize(Seq(
LabeledDocument(0L, "a b c d e spark", 1.0),
LabeledDocument(1L, "b d", 0.0),
LabeledDocument(2L, "spark f g h", 1.0),
LabeledDocument(3L, "hadoop mapreduce", 0.0),
LabeledDocument(4L, "b spark who", 1.0),
LabeledDocument(5L, "g d a y", 0.0),
LabeledDocument(6L, "spark fly", 1.0),
LabeledDocument(7L, "was mapreduce", 0.0),
LabeledDocument(8L, "e spark program", 1.0),
LabeledDocument(9L, "a e c l", 0.0),
LabeledDocument(10L, "spark compile", 1.0),
LabeledDocument(11L, "hadoop software", 0.0)))
I need a way to change my RDD (parsedData) to sequence of LabeledDocuments (like training in the example).我需要一种方法将我的 RDD (parsedData) 更改为 LabeledDocuments 序列(例如示例中的训练)。
I appreciate your help.我感谢您的帮助。
I found an answer to this question.我找到了这个问题的答案。
I can transform my RDD (parsedData) to SchemaRDD which is a sequnce of LabeledDocuments by the following code:我可以通过以下代码将我的 RDD (parsedData) 转换为 SchemaRDD,它是 LabeledDocuments 的序列:
val rddSchema = parsedData.toSchemaRDD;
Now the problem is changed!现在问题变了! I want to split the new rddSchema to training (80%) and test (20%).
我想将新的rddSchema 拆分为训练 (80%) 和测试 (20%)。 If I use randomSplit , it returns a Array[RDD[Row]] instead of SchemaRDD .
如果我使用randomSplit ,它会返回一个Array[RDD[Row]]而不是SchemaRDD 。
New problem: How to transform Array[RDD[Row]] to SchemaRDD -- OR -- how to split SchemaRDD , in which the results be SchemaRDDs ?新问题:如何将Array[RDD[Row]] 转换为SchemaRDD - 或者 - 如何拆分SchemaRDD ,其中结果为SchemaRDD ?
I tried following in pyspark-我尝试在 pyspark-
def myFunc(s):
# words = s.split(",")
s = re.sub("\"", "", s)
words = [s for s in s.split(",")]
val = words[0]
lbl = 0.0
if val == 4 or val == "4":
lbl = 0.0
elif val == 0 or val == "0":
lbl = 1.0
cleanlbl = cleanLine(words[5], True, val)
# print "cleanlblcleanlbl ",cleanlbl
return LabeledPoint(lbl, htf.transform(cleanlbl.split(" ")))
sparseList = sc.textFile("hdfs:///stats/training.1600000.processed.noemoticon.csv").map(myFunc)
sparseList.cache() # Cache data since Logistic Regression is an iterative algorithm.
# for data in dataset:
trainfeats, testfeats = sparseList.randomSplit([0.8, 0.2], 10)
You can split while parsing the data , you can hack into and change as per your need您可以在解析数据的同时进行拆分,您可以根据需要进行入侵和更改
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.