使用 Apache Spark Scala 的正则表达式 RDD

Question

I have the following RDD:我有以下RDD：

x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR

I just want to get the first part of every part of this RDD as you can see in the next example:我只想获取此 RDD 的每个部分的第一部分，如您在下一个示例中所见：

Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

To do it I am trying it in this way.为此，我正在尝试这种方式。

//Here I get the RDD:
val x = spark.sparkContext.parallelize(List(value)).collect()
//Try to use regex on it, this regex is to get until the first comma
val regex1 = """(^(.+?),)"""
val rdd_1 = x.map(g => g.matches(regex1))

This is what I am trying but is not working for me because I just get an Array of Boolean.这是我正在尝试的，但对我不起作用，因为我只得到了一个 Boolean 数组。 What am I doing wrong?我究竟做错了什么？

I am new with Apache Spark Scala.我是 Apache Spark Scala 的新手。 If you need something more just tell me it.如果您需要更多东西，请告诉我。 Thanks in advance!提前致谢！

Answer 1

Try with this regex:试试这个正则表达式：

^\s*([^,]+)(_\w+)?

Demo演示

Answer 2

try this.尝试这个。

val x: Array[String] =
    Array(
      "Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
      "group: NT=app_1,hadoop-exec,sparkConnection,Ready",
      "group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
      "group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
      "group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")

  val rdd = sc.parallelize(x)

  val result = rdd.map(lines => {
    lines.split(",")(0)
  })

  
 result.collect().foreach(println)

output: output：

Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

Answer 3

With RDD:使用 RDD：

 val el = Seq("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
   "group: NT=app_1,hadoop-exec,sparkConnection,Ready",
   "group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
   "group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
   "group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")

val rdd = spark.sparkContext.parallelize((el).map((Row(_))))
val pattern = "([a-zA-Z0-9=:_ ]+),(.*)".r

rdd.map { 
  case Row(str) => str match {
    case pattern(gr1, _) => gr1 
  }
}.foreach(println(_))

It gives:它给：

Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

使用 Apache Spark Scala 的正则表达式 RDD

问题描述

3 个解决方案

解决方案1
0 2021-12-02 13:12:07

解决方案2
0 2021-12-02 13:12:48

解决方案3
0 2021-12-02 14:49:03

使用 Apache Spark Scala 的正则表达式 RDD

问题描述

3 个解决方案

解决方案1 0 2021-12-02 13:12:07

解决方案2 0 2021-12-02 13:12:48

解决方案3 0 2021-12-02 14:49:03

解决方案1
0 2021-12-02 13:12:07

解决方案2
0 2021-12-02 13:12:48

解决方案3
0 2021-12-02 14:49:03