简体   繁体   中英

Pyspark Frequent Terms

I have an external text file with the following content:

  • 20030249, old men didnt go school
  • 20030229, I like the way old school and teachers
  • 20030249, another title goes here for any reason work
  • 20040269, text and strings are similar to horses
  • 20040551, cowbows love to ride horses and write text

The goals is to use spark RDD to get the ouput for top 2 most frequent words in a year:

  • 2003 old school
  • 2004 text horses

I have been able to get RDD in the following form:

[('2003', ['the', 'old', 'men', 'didnt', 'go', 'school', 'I', 'like', 'the', 'way', 'old', 'school', 'and', 'teachers', 'work', 'another', 'title',  'goes', 'here', 'for', 'any', 'reason', 'work' ]), ('2004', ['text', 'and', 'strings', 'are', 'similar', 'to', 'horses', 'cowbows', 'love', 'to', 'ride', 'horses, 'and', 'write', 'text' ])]

But not quiet sure how to approach it after this.

Any help is appreciated.

You can do it in this way:

from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import *
from pyspark.sql.window import Window

data = [
    ('2003', ['the', 'old', 'men', 'didnt', 'go', 'school', 'I', 'like', 'the', 'way', 'old', 'school', 'and', 'teachers', 'work', 'another', 'title',  'goes', 'here', 'for', 'any', 'reason', 'work' ]), 
    ('2004', ['text', 'and', 'strings', 'are', 'similar', 'to', 'horses', 'cowbows', 'love', 'to', 'ride', 'horses', 'and', 'write', 'text' ])
]

df = spark.createDataFrame(data, ['year','word'])
df.printSchema()
df.show(3, False)
root
 |-- year: string (nullable = true)
 |-- word: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+-------------------------------------------------------------------------------------------------------------------------------------------+
|year|word                                                                                                                                       |
+----+-------------------------------------------------------------------------------------------------------------------------------------------+
|2003|[the, old, men, didnt, go, school, I, like, the, way, old, school, and, teachers, work, another, title, goes, here, for, any, reason, work]|
|2004|[text, and, strings, are, similar, to, horses, cowbows, love, to, ride, horses, and, write, text]                                          |
+----+-------------------------------------------------------------------------------------------------------------------------------------------+

You can use the explode , window and groupby to achieve your goal:

df.withColumn('word', func.explode('word'))\
    .groupby('year', 'word')\
    .count()\
    .withColumn('rank', func.rank().over(Window.partitionBy('year').orderBy(func.desc('count'))))\
    .filter(func.col('rank')<=2)\
    .groupby('year')\
    .agg(func.collect_list('word').alias('word'))\
    .orderBy('year').show(10, False)
+----+------------------------+
|year|word                    |
+----+------------------------+
|2003|[old, the, school, work]|
|2004|[text, and, horses, to] |
+----+------------------------+

As there are only top 2 most frequent words, you can do the .filter(func.col('rank')<=2) .


If you need to do it in RDD, you need to build your own function and just use map . For example:

rdd = spark.sparkContext.parallelize(data)
rdd.collect()

def find_top_2_fequent_word(rdd):
    year, lst = rdd[0], rdd[1]
    r_lst = []
    word_dict = dict()
    for word in lst:
        word_dict[word] = word_dict.get(word, 0) + 1
    # collect the 1st freqent word
    word_1 = max(word_dict, key=word_dict.get)
    word_dict.pop(word_1)
    r_lst.append(word_1)
    # collect the 2nd freqent word
    word_2 = max(word_dict, key=word_dict.get)
    word_dict.pop(word_2)
    r_lst.append(word_2)

    return (year, r_lst)

new_rdd = rdd.map(lambda x: find_top_2_fequent_word(x))
new_rdd.collect()
[('2003', ['the', 'old']), ('2004', ['text', 'and'])]

You can design your function if you think it's not efficiency enough.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM