简体   繁体   中英

Scala Spark count regex matches in a file

I am learning Spark+Scala and I am stuck with this problem. I have one file that contains many sentences, and another file with a large number of regular expressions. Both files have one element per line.

What I want is to count how many times each regex has a match in the whole sentences file. For example if the sentences file (after becoming an array or list) was represented by ["hello world and hello life", "hello im fine", "what is your name"] , and the regex files by ["hello \\\\w+", "what \\\\w+ your", ...] then I would like the output to be something like: [("hello \\\\w+", 3),("what \\\\w+ your",1), ...]

My code is like this:

object PatternCount_v2 {
def main(args: Array[String]) {
    // The text where we will find the patterns
    val inputFile = args(0);
    // The list of patterns 
    val inputPatterns = args(1)
    val outputPath = args(2);

    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)

    // Load the text file
    val textFile = sc.textFile(inputFile).cache()
    // Load the patterns
    val patterns = Source.fromFile(inputPatterns).getLines.map(line => line.r).toList

    val patternCounts = textFile.flatMap(line => {
        println(line)
        patterns.foreach(
            pattern => {
                println(pattern)
                (pattern,pattern.findAllIn(line).length )

            }
        )
    }

    )
    patternCounts.saveAsTextFile(outputPath)


}}

But the compiler complains:

在此处输入图片说明

If I change the flatMap to just map the code runs but returns a bunch of empty tuples () () () ()

Please help! This is driving me crazy. Thanks,

As far as I can see, there are two issues here:

  1. You should use map instead of foreach : foreach returns Unit , it performs an action with a potential side effect on each element of a collection, it doesn't return a new collection. map on the other hand transform a collection into a new one by applying the supplied function to each element

  2. You're missing the part where you aggregate the results of flatMap to get the actual count per "key" (pattern). This can be done easily with reduceByKey

Altogether - this does what you need:

val patternCounts = textFile
  .flatMap(line => patterns.map(pattern => (pattern, pattern.findAllIn(line).length)))
  .reduceByKey(_ + _)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM