简体   繁体   English

Scala Spark计数正则表达式在文件中匹配

[英]Scala Spark count regex matches in a file

I am learning Spark+Scala and I am stuck with this problem. 我正在学习Spark + Scala,并且遇到了这个问题。 I have one file that contains many sentences, and another file with a large number of regular expressions. 我有一个包含许多句子的文件,而另一个包含大量正则表达式的文件。 Both files have one element per line. 这两个文件每行都有一个元素。

What I want is to count how many times each regex has a match in the whole sentences file. 我要计算的是每个正则表达式在整个句子文件中有多少次匹配。 For example if the sentences file (after becoming an array or list) was represented by ["hello world and hello life", "hello im fine", "what is your name"] , and the regex files by ["hello \\\\w+", "what \\\\w+ your", ...] then I would like the output to be something like: [("hello \\\\w+", 3),("what \\\\w+ your",1), ...] 例如,如果句子文件(成为数组或列表之后)由["hello world and hello life", "hello im fine", "what is your name"] ,而正则表达式文件则由["hello \\\\w+", "what \\\\w+ your", ...]那么我希望输出为: [("hello \\\\w+", 3),("what \\\\w+ your",1), ...]

My code is like this: 我的代码是这样的:

object PatternCount_v2 {
def main(args: Array[String]) {
    // The text where we will find the patterns
    val inputFile = args(0);
    // The list of patterns 
    val inputPatterns = args(1)
    val outputPath = args(2);

    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)

    // Load the text file
    val textFile = sc.textFile(inputFile).cache()
    // Load the patterns
    val patterns = Source.fromFile(inputPatterns).getLines.map(line => line.r).toList

    val patternCounts = textFile.flatMap(line => {
        println(line)
        patterns.foreach(
            pattern => {
                println(pattern)
                (pattern,pattern.findAllIn(line).length )

            }
        )
    }

    )
    patternCounts.saveAsTextFile(outputPath)


}}

But the compiler complains: 但是编译器抱怨:

在此处输入图片说明

If I change the flatMap to just map the code runs but returns a bunch of empty tuples () () () () 如果我将flatMap更改为仅映射,则代码将运行,但返回一堆空元组()()()()

Please help! 请帮忙! This is driving me crazy. 这真让我抓狂。 Thanks, 谢谢,

As far as I can see, there are two issues here: 据我所知,这里有两个问题:

  1. You should use map instead of foreach : foreach returns Unit , it performs an action with a potential side effect on each element of a collection, it doesn't return a new collection. 您应该使用map而不是foreachforeach返回Unit ,它对集合的每个元素执行可能具有副作用的动作,它不返回新集合。 map on the other hand transform a collection into a new one by applying the supplied function to each element 另一方面,通过将提供的函数应用于每个元素,将map转换为一个新集合

  2. You're missing the part where you aggregate the results of flatMap to get the actual count per "key" (pattern). 您缺少了汇总 flatMap结果以获取每个“键”(模式)的实际计数的部分。 This can be done easily with reduceByKey 这可以通过reduceByKey轻松reduceByKey

Altogether - this does what you need: 总共-这满足您的需求:

val patternCounts = textFile
  .flatMap(line => patterns.map(pattern => (pattern, pattern.findAllIn(line).length)))
  .reduceByKey(_ + _)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM