模式匹配-Spark Scala RDD

Question

I am new to Spark and Scala coming from R background.After a few transformations of RDD, I get a RDD of type 我是R背景的Spark和Scala的新手，在对RDD进行了一些转换之后，我得到了RDD类型

Description: RDD[(String, Int)]

Now I want to apply a Regular expression on the String RDD and extract substrings from the String and add just substring in a new coloumn. 现在，我想在字符串RDD上应用正则表达式，并从字符串中提取子字符串，然后仅在新列中添加子字符串。

Input Data : 输入数据：

BMW 1er Model,278
MINI Cooper Model,248

Output I am looking for : 我正在寻找的输出：

   Input                  |  Brand   | Series      
BMW 1er Model,278,          BMW ,        1er        
MINI Cooper Model ,248      MINI ,      Cooper

where Brand and Series are newly calculated substrings from String RDD 其中Brand和Series是来自字符串RDD的新计算的子字符串

What I have done so far. 到目前为止我所做的。

I could achieve this for a String using regular expression, but I cani apply fro all lines. 我可以使用正则表达式为String实现此功能，但是我可以将其应用于所有行。

 val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r //to look for BMW or MINI

Then I can use 那我可以用

brandRegEx.findFirstIn("hello this mini is bmW testing")

But how can I use it for all the lines of RDD and to apply different regular expression to achieve the output as above. 但是，如何将它用于RDD的所有行并应用不同的正则表达式来实现上述输出。

I read about this code snippet, but not sure how to put it altogether. 我阅读了有关此代码段的信息，但不确定如何将其完全放在一起。

val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r

def getBrand(Col4: String) : String = Col4 match {
    case brandRegEx(str)  =>  
    case _ => ""
    return 'substring
}

Any help would be appreciated ! 任何帮助，将不胜感激！

Thanks 谢谢

Answer 1

To apply your regex to each item in the RDD, you should use the RDD map function, which transforms each row in the RDD using some function (in this case, a Partial Function in order to extract to two parts of the tuple which makes up each row): 要将正则表达式应用于RDD中的每个项目，您应该使用RDD map函数，该函数使用某些函数（在本例中为Partial Function来提取RDD中的每一行，以提取组成元组的两部分）每一行）：

import org.apache.spark.{SparkContext, SparkConf}

object Example extends App {

  val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("Example"))

  val data = Seq(
    ("BMW 1er Model",278),
    ("MINI Cooper Model",248))

  val dataRDD = sc.parallelize(data)

  val processedRDD = dataRDD.map{
    case (inString, inInt) =>
      val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
      val brand = brandRegEx.findFirstIn(inString)
      //val seriesRegEx = ...
      //val series = seriesRegEx.findFirstIn(inString)
      val series = "foo"
      (inString, inInt, brand, series)
  }

  processedRDD.collect().foreach(println)
  sc.stop()
}

Note that I think you have some problems in your regular expression, and you also need a regular expression for finding the series. 请注意，我认为您的正则表达式存在一些问题，并且还需要一个正则表达式来查找序列。 This code outputs: 此代码输出：

(BMW 1er Model,278,BMW,foo)
(MINI Cooper Model,248,NOT FOUND,foo)

But if you correct your regexes for your needs, this is how you can apply them to each row. 但是，如果您根据需要更正了正则表达式，这就是将它们应用于每行的方法。

Answer 2

hi I was just looking for aother question and got this question. 您好，我只是在寻找另一个问题，并得到了这个问题。 The above problem can be done using normal transformations. 可以使用正态转换来完成上述问题。

val a=sc.parallelize(collection)
a.map{case (x,y)=>(x.split (" ")(0)+" "+x.split(" ")(1))}.collect

模式匹配-Spark Scala RDD

问题描述

2 个解决方案

解决方案1
5 已采纳 2015-12-02 09:49:59

解决方案2
0 2016-10-29 04:58:11

模式匹配-Spark Scala RDD

问题描述

2 个解决方案

解决方案1 5 已采纳 2015-12-02 09:49:59

解决方案2 0 2016-10-29 04:58:11

解决方案1
5 已采纳 2015-12-02 09:49:59

解决方案2
0 2016-10-29 04:58:11