[英]How can I make Apache Spark workers to import libraries before running map
I wanted to make a traffic report on country from nginx access.log file. 我想通过nginx access.log文件制作有关国家的交通报告。 This is my code snippet using Apache Spark on Python: 这是我在Python上使用Apache Spark的代码片段:
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonAccessLogAnalyzer")
def get_country_from_line(line):
try:
from geoip import geolite2
ip = line.split(' ')[0]
match = geolite2.lookup(ip)
if match is not None:
return match.country
else:
return "Unknown"
except IndexError:
return "Error"
rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
ips = rdd.countByValue()
print ips
sc.stop()
On a 6GB log file, it took an hour to complete the task (I ran on my Macbook Pro, 4 cores) which is too slow. 在一个6GB的日志文件上,它花了一个小时才能完成任务(我在Macbook Pro上运行,有4个内核),这太慢了。 I think the bottle neck is that whenever spark maps a line, it has to import geolite2
which has to load some database I think. 我认为瓶颈在于,每当spark绘制一条线时,它都必须导入geolite2
,而它必须加载我认为的某些数据库。 Is there anyway for me to import geolite2
on each worker instead of each line? 无论如何,我需要在每个工人而不是每条线上导入geolite2
吗? Would it boost the performance? 会提高性能吗? Any suggestion to improve that code? 有什么建议可以改进该代码吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.