简体   繁体   English

使用 Apache Spark Scala 拆分 RDD

[英]Split RDD with Apache Spark Scala

I have the following RDD:我有以下RDD:

x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR

I just want to get the first part of every part of this RDD as you can see in the next example:我只想获取此 RDD 的每个部分的第一部分,如您在下一个示例中所见:

Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

This is what I am trying.这就是我正在尝试的。

//Here I get the RDD:
var x = spark.sparkContext.parallelize(List(value)).collect()
//Try to split it
x = x.map(g => g.split(",")(0))

This is what I am trying but is not working for me because I just get the first item:这是我正在尝试的,但对我不起作用,因为我刚刚得到第一个项目:

Et: NT=grouptoClassify

I am new with Apache Spark Scala.我是 Apache Spark Scala 的新手。 If you need something more just tell me it.如果您需要更多东西,请告诉我。 Thanks in advance!提前致谢!

Avoid var with Spark rdds and dataframes.避免使用 Spark rdds 和数据帧的var Use immutable val objects:使用不可变val对象:

  val rdd = 
    sparkSession
      .sparkContext
      .parallelize(List(value))
      .map(g => g.split(",").head)
  
  val strings = rdd.collect()
  • collect is called in the end.最后调用collect This retrieves the distributed data on the driver ie those strings这将检索驱动程序上的分布式数据,即那些字符串

Here is theRDD programming guide .这是RDD 编程指南

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM