Spark-字数测试

Question

我只想计算spark（pyspark）中的单词，但是我可以映射字母或整个字符串。

我试过了：（整个字符串）

v1='Hi hi hi bye bye bye word count' 
v1_temp=sc.parallelize([v1]) 
v1_map = v1_temp.flatMap(lambda x: x.split('\t'))
v1_counts = v1_map.map(lambda x: (x, 1))
v1_counts.collect()

或（仅字母）

v1='Hi hi hi bye bye bye word count'
v1_temp=sc.parallelize(v1)
v1_map = v1_temp.flatMap(lambda x: x.split('\t'))
v1_counts = v1_map.map(lambda x: (x, 1))
v1_counts.collect()

Answer 1

当您执行sc.parallelize(sequence)您正在创建将并行操作的RDD。 在第一种情况下，序列是一个包含单个元素（整个句子）的列表。 在第二种情况下，您的序列是一个字符串，在python中类似于字符串列表。

如果要并行计算单词，可以执行以下操作：

from operator import add

s = 'Hi hi hi bye bye bye word count' 
seq = s.split()   # ['Hi', 'hi', 'hi', 'bye', 'bye', 'bye', 'word', 'count']
sc.parallelize(seq)\
  .map(lambda word: (word, 1))\
  .reduceByKey(add)\
  .collect()

将为您提供：

[('count', 1), ('word', 1), ('bye', 3), ('hi', 2), ('Hi', 1)]

Answer 2

如果只想计算字母数字单词，这可能是一种解决方案：

import time, re
from pyspark import SparkContext, SparkConf

def linesToWordsFunc(line):
    wordsList = line.split()
    wordsList = [re.sub(r'\W+', '', word) for word in wordsList]
    filtered = filter(lambda word: re.match(r'\w+', word), wordsList)
    return filtered

def wordsToPairsFunc(word):
    return (word, 1)

def reduceToCount(a, b):
    return (a + b)

def main():
    conf = SparkConf().setAppName("Words count").setMaster("local")
    sc = SparkContext(conf=conf)
    rdd = sc.textFile("your_file.txt")

    words = rdd.flatMap(linesToWordsFunc)
    pairs = words.map(wordsToPairsFunc)
    counts = pairs.reduceByKey(reduceToCount)

    # Get the first top 100 words
    output = counts.takeOrdered(100, lambda (k, v): -v)

    for(word, count) in output:
        print word + ': ' + str(count)

    sc.stop()

if __name__ == "__main__":
    main()

Answer 3

在线有许多版本的wordcount，下面只是其中的一个；

#to count the words in a file hdfs:/// of file:/// or localfile "./samplefile.txt"
rdd=sc.textFile(filename)

#or you can initialize with your list
v1='Hi hi hi bye bye bye word count' 
rdd=sc.parallelize([v1])


wordcounts=rdd.flatMap(lambda l: l.split(' ')) \
        .map(lambda w:(w,1)) \
        .reduceByKey(lambda a,b:a+b) \
        .map(lambda (a,b):(b,a)) \
        .sortByKey(ascending=False)

output = wordcounts.collect()

for (count,word) in output:
    print("%s: %i" % (word,count))

Spark-字数测试

问题描述

3 个解决方案

解决方案1
4 2015-01-16 15:55:00

解决方案2
3 2015-09-29 13:28:32

解决方案3
2 2017-04-18 19:01:42

Spark-字数测试

问题描述

3 个解决方案

解决方案1 4 2015-01-16 15:55:00

解决方案2 3 2015-09-29 13:28:32

解决方案3 2 2017-04-18 19:01:42

解决方案1
4 2015-01-16 15:55:00

解决方案2
3 2015-09-29 13:28:32

解决方案3
2 2017-04-18 19:01:42