简体   繁体   中英

Counting all possible word pairs using pyspark

I have a text document. I need to find the possible counts of repeating word pairs in the overall document. For example, I have the below word document. The document has two lines, each line separated by ';'. Document:

My name is Sam My name is Sam My name is Sam;
My name is Sam;

I am working on pairwords count.The expected out is:

[(('my', 'my'), 3), (('name', 'is'), 7), (('is', 'name'), 3), (('sam', 'sam'), 3), (('my', 'name'), 7), (('name', 'sam'), 7), (('is', 'my'), 3), (('sam', 'is'), 3), (('my', 'sam'), 7), (('name', 'name'), 3), (('is', 'is'), 3), (('sam', 'my'), 3), (('my', 'is'), 7), (('name', 'my'), 3), (('is', 'sam'), 7), (('sam', 'name'), 3)]

If I use:

wordPairCount = rddData.map(lambda line: line.split()).flatMap(lambda x: [((x[i], x[i + 1]), 1) for i in range(0, len(x) - 1)]).reduceByKey(lambda a,b:a + b)

I get pair-words of consecutive words and their count of re-occurences.

How can I pair each word with every other word in the line and then search for the same pair in all lines?

Can someone please have a look? Thanks

Your input string:

# spark is SparkSession object
s1 = 'The Adventure of the Blue Carbuncle The Adventure of the Blue Carbuncle The Adventure of the Blue Carbuncle; The Adventure of the Blue Carbuncle;'

# Split the string on ; and I parallelize it to make an rdd
rddData = spark.sparkContext.parallelize(rdd_Data.split(";"))

rddData.collect()
# ['The Adventure of the Blue Carbuncle The Adventure of the Blue Carbuncle The Adventure of the Blue Carbuncle', ' The Adventure of the Blue Carbuncle', '']

import itertools

final = (
    rddData.filter(lambda x: x != "")
        .map(lambda x: x.split(" "))
        .flatMap(lambda x: itertools.combinations(x, 2))
        .filter(lambda x: x[0] != "")
        .map(lambda x: (x, 1))
        .reduceByKey(lambda x, y: x + y).collect()
)
# [(('The', 'of'), 7), (('The', 'Blue'), 7), (('The', 'Carbuncle'), 7), (('Adventure', 'the'), 7), (('Adventure', 'Adventure'), 3), (('of', 'The'), 3), (('the', 'Adventure'), 3), (('the', 'the'), 3), (('Blue', 'The'), 3), (('Carbuncle', 'The'), 3), (('Adventure', 'The'), 3), (('of', 'the'), 7), (('of', 'Adventure'), 3), (('the', 'The'), 3), (('Blue', 'Adventure'), 3), (('Blue', 'the'), 3), (('Carbuncle', 'Adventure'), 3), (('Carbuncle', 'the'), 3), (('The', 'The'), 3), (('of', 'Blue'), 7), (('of', 'Carbuncle'), 7), (('of', 'of'), 3), (('Blue', 'Carbuncle'), 7), (('Blue', 'of'), 3), (('Blue', 'Blue'), 3), (('Carbuncle', 'of'), 3), (('Carbuncle', 'Blue'), 3), (('Carbuncle', 'Carbuncle'), 3), (('The', 'Adventure'), 7), (('The', 'the'), 7), (('Adventure', 'of'), 7), (('Adventure', 'Blue'), 7), (('Adventure', 'Carbuncle'), 7), (('the', 'Blue'), 7), (('the', 'Carbuncle'), 7), (('the', 'of'), 3)]
  1. Remove any blank spaces from the first split
  2. Split x which is a space divided string, by space
  3. Create combinations of 2 elements each using itertools.combinations ( flatMap for pair each word with every other word in the line)
  4. Map and reduce like you would do for a word-count

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM