I have an external text file with the following content:
The goals is to use spark RDD to get the ouput for top 2 most frequent words in a year:
I have been able to get RDD in the following form:
[('2003', ['the', 'old', 'men', 'didnt', 'go', 'school', 'I', 'like', 'the', 'way', 'old', 'school', 'and', 'teachers', 'work', 'another', 'title', 'goes', 'here', 'for', 'any', 'reason', 'work' ]), ('2004', ['text', 'and', 'strings', 'are', 'similar', 'to', 'horses', 'cowbows', 'love', 'to', 'ride', 'horses, 'and', 'write', 'text' ])]
But not quiet sure how to approach it after this.
Any help is appreciated.
You can do it in this way:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import *
from pyspark.sql.window import Window
data = [
('2003', ['the', 'old', 'men', 'didnt', 'go', 'school', 'I', 'like', 'the', 'way', 'old', 'school', 'and', 'teachers', 'work', 'another', 'title', 'goes', 'here', 'for', 'any', 'reason', 'work' ]),
('2004', ['text', 'and', 'strings', 'are', 'similar', 'to', 'horses', 'cowbows', 'love', 'to', 'ride', 'horses', 'and', 'write', 'text' ])
]
df = spark.createDataFrame(data, ['year','word'])
df.printSchema()
df.show(3, False)
root
|-- year: string (nullable = true)
|-- word: array (nullable = true)
| |-- element: string (containsNull = true)
+----+-------------------------------------------------------------------------------------------------------------------------------------------+
|year|word |
+----+-------------------------------------------------------------------------------------------------------------------------------------------+
|2003|[the, old, men, didnt, go, school, I, like, the, way, old, school, and, teachers, work, another, title, goes, here, for, any, reason, work]|
|2004|[text, and, strings, are, similar, to, horses, cowbows, love, to, ride, horses, and, write, text] |
+----+-------------------------------------------------------------------------------------------------------------------------------------------+
You can use the explode
, window
and groupby
to achieve your goal:
df.withColumn('word', func.explode('word'))\
.groupby('year', 'word')\
.count()\
.withColumn('rank', func.rank().over(Window.partitionBy('year').orderBy(func.desc('count'))))\
.filter(func.col('rank')<=2)\
.groupby('year')\
.agg(func.collect_list('word').alias('word'))\
.orderBy('year').show(10, False)
+----+------------------------+
|year|word |
+----+------------------------+
|2003|[old, the, school, work]|
|2004|[text, and, horses, to] |
+----+------------------------+
As there are only top 2 most frequent words, you can do the .filter(func.col('rank')<=2)
.
If you need to do it in RDD, you need to build your own function and just use map
. For example:
rdd = spark.sparkContext.parallelize(data)
rdd.collect()
def find_top_2_fequent_word(rdd):
year, lst = rdd[0], rdd[1]
r_lst = []
word_dict = dict()
for word in lst:
word_dict[word] = word_dict.get(word, 0) + 1
# collect the 1st freqent word
word_1 = max(word_dict, key=word_dict.get)
word_dict.pop(word_1)
r_lst.append(word_1)
# collect the 2nd freqent word
word_2 = max(word_dict, key=word_dict.get)
word_dict.pop(word_2)
r_lst.append(word_2)
return (year, r_lst)
new_rdd = rdd.map(lambda x: find_top_2_fequent_word(x))
new_rdd.collect()
[('2003', ['the', 'old']), ('2004', ['text', 'and'])]
You can design your function if you think it's not efficiency enough.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.