使用 Apache Spark Scala 拆分 RDD

Question

I have the following RDD:我有以下RDD：

x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR

I just want to get the first part of every part of this RDD as you can see in the next example:我只想获取此 RDD 的每个部分的第一部分，如您在下一个示例中所见：

Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

This is what I am trying.这就是我正在尝试的。

//Here I get the RDD:
var x = spark.sparkContext.parallelize(List(value)).collect()
//Try to split it
x = x.map(g => g.split(",")(0))

This is what I am trying but is not working for me because I just get the first item:这是我正在尝试的，但对我不起作用，因为我刚刚得到第一个项目：

Et: NT=grouptoClassify

I am new with Apache Spark Scala.我是 Apache Spark Scala 的新手。 If you need something more just tell me it.如果您需要更多东西，请告诉我。 Thanks in advance!提前致谢！

Answer 1

Avoid var with Spark rdds and dataframes.避免使用 Spark rdds 和数据帧的var 。 Use immutable val objects:使用不可变val对象：

  val rdd = 
    sparkSession
      .sparkContext
      .parallelize(List(value))
      .map(g => g.split(",").head)
  
  val strings = rdd.collect()

collect is called in the end.最后调用collect 。 This retrieves the distributed data on the driver ie those strings这将检索驱动程序上的分布式数据，即那些字符串

Here is theRDD programming guide .这是RDD 编程指南。

使用 Apache Spark Scala 拆分 RDD

问题描述

1 个解决方案

解决方案1
0 2021-12-02 11:08:20

使用 Apache Spark Scala 拆分 RDD

问题描述

1 个解决方案

解决方案1 0 2021-12-02 11:08:20

解决方案1
0 2021-12-02 11:08:20