简体   繁体   中英

Spark can't find Python module

I'm trying to run the following Python script locally, using spark-submit command:

import sys
sys.path.insert(0, '.')
from pyspark import SparkContext, SparkConf
from commons.Utils import Utils

def splitComma(line):
    splits = Utils.COMMA_DELIMITER.split(line)
    return "{}, {}".format(splits[1], splits[2])

if __name__ == "__main__":
    conf = SparkConf().setAppName("airports").setMaster("local[2]")
    sc = SparkContext(conf = conf)

    airports = sc.textFile("in/airports.text")
    airportsInUSA = airports\
    .filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == "\"United States\"")

    airportsNameAndCityNames = airportsInUSA.map(splitComma)
    airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")

The command used (while inside the project directory):

spark-submit rdd/AirportsInUsaSolution.py

I keep getting this error:

Traceback (most recent call last): File "/home/gustavo/Documentos/TCC/python_spark_yt/python-spark-tutorial/rdd/AirportsInUsaSolution.py", line 4, in from commons.Utils import Utils ImportError: No module named commons.Utils

Even though there is a commons.Utils with a Utils class.

It seems that the only imports it accepts are the ones from Spark, because this error persists when I try to import any other class or file from my project.

from pyspark import SparkContext, SparkConf

def splitComma(line):
    splits = Utils.COMMA_DELIMITER.split(line)
    return "{}, {}".format(splits[1], splits[2])

if __name__ == "__main__":
    conf = SparkConf().setAppName("airports").setMaster("local[2]")
    sc = SparkContext(conf = conf)

    sc.addPyFile('.../pathto commons.zip')
    from commons import Utils

    airports = sc.textFile("in/airports.text")
    airportsInUSA = airports\
    .filter(lambda line : Utils.COMMA_DELIMITER.split(line)[3] == "\"United States\"")

    airportsNameAndCityNames = airportsInUSA.map(splitComma)
    airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")

Yes, it only accepts the ones from the Spark. You can zip the required files (Utils, numpy) etc and specify the parameter --py-files in the spark-submit.

spark-submit  --py-files rdd/file.zip rdd/AirportsInUsaSolution.py 

for python to consider a directory as package you need to create __init__.py in that directory. The __init__.py file doesn't need to contain anything.

In this case once you create __init__.py in the commons directory you will be able to access that package.

Create a python script named: Utils.py which will contain:

import re

class Utils():

    COMMA_DELIMITER = re.compile(''',(?=(?:[^"]*"[^"]*")*[^"]*$)''')

Put this Utils.py python script on a commons folder and put this folder in your working directory (type pwd to know it). You can then import the Utils class:

from commons.Utils import Utils

Hope it will help you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM