简体   繁体   English

如何在运行地图之前让Apache Spark工作者导入库

[英]How can I make Apache Spark workers to import libraries before running map

I wanted to make a traffic report on country from nginx access.log file. 我想通过nginx access.log文件制作有关国家的交通报告。 This is my code snippet using Apache Spark on Python: 这是我在Python上使用Apache Spark的代码片段:

from pyspark import SparkContext

if __name__ == "__main__":
    sc = SparkContext(appName="PythonAccessLogAnalyzer")
    def get_country_from_line(line):
        try:
            from geoip import geolite2
            ip = line.split(' ')[0]
            match = geolite2.lookup(ip)
            if match is not None:
                return match.country
            else:
                return "Unknown"
        except IndexError:
            return "Error"

    rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
    ips = rdd.countByValue()

    print ips
    sc.stop()

On a 6GB log file, it took an hour to complete the task (I ran on my Macbook Pro, 4 cores) which is too slow. 在一个6GB的日志文件上,它花了一个小时才能完成任务(我在Macbook Pro上运行,有4个内核),这太慢了。 I think the bottle neck is that whenever spark maps a line, it has to import geolite2 which has to load some database I think. 我认为瓶颈在于,每当spark绘制一条线时,它都必须导入geolite2 ,而它必须加载我认为的某些数据库。 Is there anyway for me to import geolite2 on each worker instead of each line? 无论如何,我需要在每个工人而不是每条线上导入geolite2吗? Would it boost the performance? 会提高性能吗? Any suggestion to improve that code? 有什么建议可以改进该代码吗?

What about using broadcast variables? 那使用广播变量呢? Here is the doc which explains how they work. 是解释它们如何工作的文档。 However they are simply read-only variables which are spread to all worker nodes once per worker and then accessed whenever necessary. 但是,它们只是简单的只读变量,每个变量一次传播到所有工作程序节点,然后在必要时进行访问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在PySpark中处理数据之前在所有Spark worker上运行一个函数? - How to run a function on all Spark workers before processing data in PySpark? 如何使用Apache-Spark使python代码在AWS从属节点上运行? - How can I make my python code run on the AWS slave nodes using Apache-Spark? 通过Pyspark提交Spark作业时,如何确保在工作人员上使用哪个Python? - When I submit a Spark job through Pyspark, how can I ensure which Python is used on the workers? 如何安装Apache spark并使其与Kafka一起运行? - How do I install Apache spark and get it up and running with Kafka? 如何从另一个 Anaconda 环境导入库 - How can I import libraries from another Anaconda environment Apache Spark的设置不删除工作线程上的临时文件 - Apache Spark is there a setting NOT to delete temp files on workers SCOOP - 如何让工作人员在继续之前等待根工作人员 - SCOOP - How to make workers wait for root worker before continuing 使用Python的multiprocessing.Pool和map_asynch,如何获取有关工作程序的信息? - With Python's multiprocessing.Pool and map_asynch, how can I get information about the workers? 如何在Apache Spark中指定浮点精度? - How can I specify floating point precision in apache spark? 我们如何在 Pyspark 中使用 import org.apache.spark.sql.catalyst.parser.CatalystSqlParser - How can we use import org.apache.spark.sql.catalyst.parser.CatalystSqlParser in Pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM