Pyspark Frequent Terms

Question

I have an external text file with the following content:

20030249, old men didnt go school
20030229, I like the way old school and teachers
20030249, another title goes here for any reason work
20040269, text and strings are similar to horses
20040551, cowbows love to ride horses and write text

The goals is to use spark RDD to get the ouput for top 2 most frequent words in a year:

2003 old school
2004 text horses

I have been able to get RDD in the following form:

[('2003', ['the', 'old', 'men', 'didnt', 'go', 'school', 'I', 'like', 'the', 'way', 'old', 'school', 'and', 'teachers', 'work', 'another', 'title',  'goes', 'here', 'for', 'any', 'reason', 'work' ]), ('2004', ['text', 'and', 'strings', 'are', 'similar', 'to', 'horses', 'cowbows', 'love', 'to', 'ride', 'horses, 'and', 'write', 'text' ])]

But not quiet sure how to approach it after this.

Any help is appreciated.

Answer 1

You can do it in this way:

from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import *
from pyspark.sql.window import Window

data = [
    ('2003', ['the', 'old', 'men', 'didnt', 'go', 'school', 'I', 'like', 'the', 'way', 'old', 'school', 'and', 'teachers', 'work', 'another', 'title',  'goes', 'here', 'for', 'any', 'reason', 'work' ]), 
    ('2004', ['text', 'and', 'strings', 'are', 'similar', 'to', 'horses', 'cowbows', 'love', 'to', 'ride', 'horses', 'and', 'write', 'text' ])
]

df = spark.createDataFrame(data, ['year','word'])
df.printSchema()
df.show(3, False)
root
 |-- year: string (nullable = true)
 |-- word: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+-------------------------------------------------------------------------------------------------------------------------------------------+
|year|word                                                                                                                                       |
+----+-------------------------------------------------------------------------------------------------------------------------------------------+
|2003|[the, old, men, didnt, go, school, I, like, the, way, old, school, and, teachers, work, another, title, goes, here, for, any, reason, work]|
|2004|[text, and, strings, are, similar, to, horses, cowbows, love, to, ride, horses, and, write, text]                                          |
+----+-------------------------------------------------------------------------------------------------------------------------------------------+

You can use the explode , window and groupby to achieve your goal:

df.withColumn('word', func.explode('word'))\
    .groupby('year', 'word')\
    .count()\
    .withColumn('rank', func.rank().over(Window.partitionBy('year').orderBy(func.desc('count'))))\
    .filter(func.col('rank')<=2)\
    .groupby('year')\
    .agg(func.collect_list('word').alias('word'))\
    .orderBy('year').show(10, False)
+----+------------------------+
|year|word                    |
+----+------------------------+
|2003|[old, the, school, work]|
|2004|[text, and, horses, to] |
+----+------------------------+

As there are only top 2 most frequent words, you can do the .filter(func.col('rank')<=2) .

If you need to do it in RDD, you need to build your own function and just use map . For example:

rdd = spark.sparkContext.parallelize(data)
rdd.collect()

def find_top_2_fequent_word(rdd):
    year, lst = rdd[0], rdd[1]
    r_lst = []
    word_dict = dict()
    for word in lst:
        word_dict[word] = word_dict.get(word, 0) + 1
    # collect the 1st freqent word
    word_1 = max(word_dict, key=word_dict.get)
    word_dict.pop(word_1)
    r_lst.append(word_1)
    # collect the 2nd freqent word
    word_2 = max(word_dict, key=word_dict.get)
    word_dict.pop(word_2)
    r_lst.append(word_2)

    return (year, r_lst)

new_rdd = rdd.map(lambda x: find_top_2_fequent_word(x))
new_rdd.collect()
[('2003', ['the', 'old']), ('2004', ['text', 'and'])]

You can design your function if you think it's not efficiency enough.

Pyspark Frequent Terms

Question

1 answers

solution1
0 2022-08-06 11:44:19

Pyspark Frequent Terms

Question

1 answers

solution1 0 2022-08-06 11:44:19

solution1
0 2022-08-06 11:44:19