简体   繁体   中英

spark streaming job performance improvement

There is a spark streaming job running all the time, counting words in the stream, and only the words in a given vocabulary should be counted and returned.

However, this vocabulary is not fixed, instead the vocabulary is stored in redis and can change over time. Here is a naive implementation of this job:

sc = SparkContext(appName="WordCount")
ssc = StreamingContext(sc, 10)  # batch interval is 10s

def check_if_in_vocab(word):
    vocab = redis_client.smembers()  # get all vocabulary from redis
    return word in vocab

lines = ssc.socketTextStream(host_ip, port)  # read data stream from the socket
words = lines.flatMap(lambda line: line.split(" "))\
                          .filter(check_if_in_vocab)\  # ANY BETTER SOLUTION HERE??? 
                          .map(lambda word: (word, 1))  # create (word, count) pair
counts = words.reduceByKey(lambda x,y: x+y)
counts.pprint()

My implementation has poor performance I think, because the filter(check_if_in_vocabulary) transformation pull the vocabulary from redis for each element in the stream, that would be too time consuming.

Any better solution?

FOLLOWUP

OK, in the above problem, since the vocabulary might change any time, so I need to check redis very often, now suppose the vocabulary only changes every 60sec or 1h, would it be simpler to improve the above code?

I have never worked with Spark in Python, but if this were a Scala implementation, I would look for improvements by making my Redis call inside a mapPartitions call. Inside mapPartitions , I would establish a connection with the client, fetch the vocab , use that in-memory vocab to filter the iterable, and then close the connection.

Perhaps you can do something analogous with the Python API.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM