如何在运行地图之前让Apache Spark工作者导入库

Question

I wanted to make a traffic report on country from nginx access.log file. 我想通过nginx access.log文件制作有关国家的交通报告。 This is my code snippet using Apache Spark on Python: 这是我在Python上使用Apache Spark的代码片段：

from pyspark import SparkContext

if __name__ == "__main__":
    sc = SparkContext(appName="PythonAccessLogAnalyzer")
    def get_country_from_line(line):
        try:
            from geoip import geolite2
            ip = line.split(' ')[0]
            match = geolite2.lookup(ip)
            if match is not None:
                return match.country
            else:
                return "Unknown"
        except IndexError:
            return "Error"

    rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
    ips = rdd.countByValue()

    print ips
    sc.stop()

On a 6GB log file, it took an hour to complete the task (I ran on my Macbook Pro, 4 cores) which is too slow. 在一个6GB的日志文件上，它花了一个小时才能完成任务（我在Macbook Pro上运行，有4个内核），这太慢了。 I think the bottle neck is that whenever spark maps a line, it has to import geolite2 which has to load some database I think. 我认为瓶颈在于，每当spark绘制一条线时，它都必须导入geolite2 ，而它必须加载我认为的某些数据库。 Is there anyway for me to import geolite2 on each worker instead of each line? 无论如何，我需要在每个工人而不是每条线上导入geolite2吗？ Would it boost the performance? 会提高性能吗？ Any suggestion to improve that code? 有什么建议可以改进该代码吗？

Answer 1

What about using broadcast variables? 那使用广播变量呢？ Here is the doc which explains how they work. 这是解释它们如何工作的文档。 However they are simply read-only variables which are spread to all worker nodes once per worker and then accessed whenever necessary. 但是，它们只是简单的只读变量，每个变量一次传播到所有工作程序节点，然后在必要时进行访问。

如何在运行地图之前让Apache Spark工作者导入库

问题描述

1 个解决方案

解决方案1
0 2015-03-10 10:30:58

如何在运行地图之前让Apache Spark工作者导入库

问题描述

1 个解决方案

解决方案1 0 2015-03-10 10:30:58

解决方案1
0 2015-03-10 10:30:58