简体   繁体   中英

Split RDD with Apache Spark Scala

I have the following RDD:

x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR

I just want to get the first part of every part of this RDD as you can see in the next example:

Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

This is what I am trying.

//Here I get the RDD:
var x = spark.sparkContext.parallelize(List(value)).collect()
//Try to split it
x = x.map(g => g.split(",")(0))

This is what I am trying but is not working for me because I just get the first item:

Et: NT=grouptoClassify

I am new with Apache Spark Scala. If you need something more just tell me it. Thanks in advance!

Avoid var with Spark rdds and dataframes. Use immutable val objects:

  val rdd = 
    sparkSession
      .sparkContext
      .parallelize(List(value))
      .map(g => g.split(",").head)
  
  val strings = rdd.collect()
  • collect is called in the end. This retrieves the distributed data on the driver ie those strings

Here is theRDD programming guide .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM