如何在Spark / Scala中使用频率计数的文本文件创建一个二元组？

Question

我想要一个文本文件并创建一个没有用“。”分隔的所有单词的二元组，删除任何特殊字符。 我正在尝试使用Spark和Scala来做到这一点。

本文：

朋友你好。 如何
你今天？ 再见，我的朋友。

应该产生以下内容：

你好，1
我的朋友，2
怎么样，1
你今天，1
今天再见，1
再见，1

Answer 1

对于RDD中的每一行，首先根据'.'进行拆分'.' 。 然后通过拆分' '标记每个生成的子串。 标记化后，使用replaceAll删除特殊字符并转换为小写。 这些子列表中的每一个都可以通过sliding转换为包含bigrams的字符串数组的迭代器。

然后，在根据请求mkString平并将bigram数组转换为具有mkString字符串之后，使用groupBy和mapValues获取每个数组的计数。

最后压平，减少并收集RDD中的（二元组，计数）元组。

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend."))

rdd.map{ 

    // Split each line into substrings by periods
    _.split('.').map{ substrings =>

        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').

        // Remove non-alphanumeric characters, using Shyamendra's
        // clean replacement technique, and convert to lowercase
        map{_.replaceAll("""\W""", "").toLowerCase()}.

        // Find bigrams
        sliding(2)
    }.

    // Flatten, and map the bigrams to concatenated strings
    flatMap{identity}.map{_.mkString(" ")}.

    // Group the bigrams and count their frequency
    groupBy{identity}.mapValues{_.size}

}.

// Reduce to get a global count, then collect
flatMap{identity}.reduceByKey(_+_).collect.

// Format and print
foreach{x=> println(x._1 + ", " + x._2)}

you today, 1
hello my, 1
my friend, 2
how are, 1
bye my, 1
today bye, 1

Answer 2

例如，为了将整个单词与任何标点符号分开考虑

val words = text.split("\\W+")

在这种情况下提供

Array[String] = Array(Hello, my, Friend, How, are, you, today, bye, my, friend)

将单词与元组配对证明更多内容与二元组的概念相关，因此可以考虑

for( Array(a,b,_*) <- words.sliding(2).toArray ) 
yield (a.toLowerCase(), b.toLowerCase())

产量

Array((hello,my), (my,friend), (friend,How), (how,are), 
      (are,you), (you,today), (today,bye), (bye,my), (my,friend))

ohruunuruus的回答传达了一种简洁的方法。

Answer 3

这应该适用于Spark：

def bigramsInString(s: String): Array[((String, String), Int)] = { 

    s.split("""\.""")                        // split on .
     .map(_.split(" ")                       // split on space
           .filter(_.nonEmpty)               // remove empty string
           .map(_.replaceAll("""\W""", "")   // remove special chars
                 .toLowerCase)
           .filter(_.nonEmpty)                
           .sliding(2)                       // take continuous pairs
           .filter(_.size == 2)              // sliding can return partial
           .map{ case Array(a, b) => ((a, b), 1) })
     .flatMap(x => x)                         
}

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend."))

rdd.map(bigramsInString)
   .flatMap(x => x)             
   .countByKey                   // get result in driver memory as Map
   .foreach{ case ((x, y), z) => println(s"${x} ${y}, ${z}") }

// my friend, 2
// how are, 1
// today bye, 1
// bye my, 1
// you today, 1
// hello my, 1

如何在Spark / Scala中使用频率计数的文本文件创建一个二元组？

问题描述

3 个解决方案

解决方案1
8 已采纳 2015-04-18 04:00:27

解决方案2
1 2015-04-18 06:26:24

解决方案3
1 2015-04-18 06:29:40

如何在Spark / Scala中使用频率计数的文本文件创建一个二元组？

问题描述

3 个解决方案

解决方案1 8 已采纳 2015-04-18 04:00:27

解决方案2 1 2015-04-18 06:26:24

解决方案3 1 2015-04-18 06:29:40

解决方案1
8 已采纳 2015-04-18 04:00:27

解决方案2
1 2015-04-18 06:26:24

解决方案3
1 2015-04-18 06:29:40