简体   繁体   English

模式匹配-Spark Scala RDD

[英]Pattern matching - spark scala RDD

I am new to Spark and Scala coming from R background.After a few transformations of RDD, I get a RDD of type 我是R背景的Spark和Scala的新手,在对RDD进行了一些转换之后,我得到了RDD类型

Description: RDD[(String, Int)]

Now I want to apply a Regular expression on the String RDD and extract substrings from the String and add just substring in a new coloumn. 现在,我想在字符串RDD上应用正则表达式,并从字符串中提取子字符串,然后仅在新列中添加子字符串。

Input Data : 输入数据 :

BMW 1er Model,278
MINI Cooper Model,248

Output I am looking for : 我正在寻找的输出:

   Input                  |  Brand   | Series      
BMW 1er Model,278,          BMW ,        1er        
MINI Cooper Model ,248      MINI ,      Cooper

where Brand and Series are newly calculated substrings from String RDD 其中Brand和Series是来自字符串RDD的新计算的子字符串

What I have done so far. 到目前为止我所做的。

I could achieve this for a String using regular expression, but I cani apply fro all lines. 我可以使用正则表达式为String实现此功能,但是我可以将其应用于所有行。

 val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r //to look for BMW or MINI

Then I can use 那我可以用

brandRegEx.findFirstIn("hello this mini is bmW testing")

But how can I use it for all the lines of RDD and to apply different regular expression to achieve the output as above. 但是,如何将它用于RDD的所有行并应用不同的正则表达式来实现上述输出。

I read about this code snippet, but not sure how to put it altogether. 我阅读了有关此代码段的信息,但不确定如何将其完全放在一起。

val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r

def getBrand(Col4: String) : String = Col4 match {
    case brandRegEx(str)  =>  
    case _ => ""
    return 'substring
}

Any help would be appreciated ! 任何帮助,将不胜感激 !

Thanks 谢谢

To apply your regex to each item in the RDD, you should use the RDD map function, which transforms each row in the RDD using some function (in this case, a Partial Function in order to extract to two parts of the tuple which makes up each row): 要将正则表达式应用于RDD中的每个项目,您应该使用RDD map函数,该函数使用某些函数(在本例中为Partial Function来提取RDD中的每一行,以提取组成元组的两部分)每一行):

import org.apache.spark.{SparkContext, SparkConf}

object Example extends App {

  val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("Example"))

  val data = Seq(
    ("BMW 1er Model",278),
    ("MINI Cooper Model",248))

  val dataRDD = sc.parallelize(data)

  val processedRDD = dataRDD.map{
    case (inString, inInt) =>
      val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
      val brand = brandRegEx.findFirstIn(inString)
      //val seriesRegEx = ...
      //val series = seriesRegEx.findFirstIn(inString)
      val series = "foo"
      (inString, inInt, brand, series)
  }

  processedRDD.collect().foreach(println)
  sc.stop()
}

Note that I think you have some problems in your regular expression, and you also need a regular expression for finding the series. 请注意,我认为您的正则表达式存在一些问题,并且还需要一个正则表达式来查找序列。 This code outputs: 此代码输出:

(BMW 1er Model,278,BMW,foo)
(MINI Cooper Model,248,NOT FOUND,foo)

But if you correct your regexes for your needs, this is how you can apply them to each row. 但是,如果您根据需要更正了正则表达式,这就是将它们应用于每行的方法。

hi I was just looking for aother question and got this question. 您好,我只是在寻找另一个问题,并得到了这个问题。 The above problem can be done using normal transformations. 可以使用正态转换来完成上述问题。

val a=sc.parallelize(collection)
a.map{case (x,y)=>(x.split (" ")(0)+" "+x.split(" ")(1))}.collect

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM