简体   繁体   English

在 Pyspark 中使用时,具有静态文件依赖项的 python 包无法读取静态文件

[英]A python package with static file dependency fails to read static file when used within Pyspark

I am trying to resolve an issue with python packages PySpark.我正在尝试解决 Python 包 PySpark 的问题。 I developed a python package which has the following structure.我开发了一个具有以下结构的python包。

sample_package/
  |-config/
       |-sample.ini
  |-main.py
  |-__init__.py

Inside my main.py , I have a code snippet that reads the config file from the config/ directory as follows在我的main.py ,我有一个代码片段从config/目录中读取配置文件,如下所示

import ConfigParser, os
def sample_func():
    config = ConfigParser.ConfigParser()
    configfile = os.path.join(os.path.dirname(__file__), 'config', 'sample.ini')
    config.read(configfile)
    return config.sections()

I created a zip file of the above package as sample_package.zip and included the zip as a pyspark dependency我创建了上述包的 zip 文件作为sample_package.zip并将 zip 作为 pyspark 依赖项包含在内

addPyFile(path/to/zip/file)

In my pyspark job when i import the sample_package the import works fine and i'm able to call the sample_func inside main, but however my python package is unable to read the sample.ini file.在我的 pyspark 作业中,当我导入sample_package ,导入工作正常,我可以在 main 中调用sample_func ,但是我的 python 包无法读取sample.ini文件。 When executed inside a plain python program, it works fine but not inside a pyspark job.在普通的 Python 程序中执行时,它可以正常工作,但不能在 pyspark 作业中执行。 Is there any path manipulation being done in a pyspark environment when accessing static files?访问静态文件时是否在 pyspark 环境中进行了任何路径操作? How can I get my python package to properly read the config file?如何让我的 python 包正确读取配置文件?

I figured out the answer by myself.我自己想出了答案。 It is more of a python packaging issue rather than pyspark environment issue.它更像是一个 python 打包问题,而不是 pyspark 环境问题。 Looks like I had to use pkgutil module to reference my static resources which modifies my function as below看起来我不得不使用pkgutil模块来引用我的静态资源,它修改了我的功能,如下所示

import ConfigParser, os, pkgutil, StringIO
def sample_func():
    config = ConfigParser.ConfigParser()
    configfile = pkgutil.get_data('sample_package', 'config/sample.ini')
    cf_buf = StringIO.StringIO(configfile)
    config.readfp(cf_buf)
    return config.sections()

More simple version:更简单的版本:

from configparser import ConfigParser
import pkgutil

def sample_func():
    config = ConfigParser()
    # os.path.join is not needed.
    config_data = pkgutil.get_data(__name__, 'config/sample.ini').decode()
    config.read_string(config_data)
    return config.sections()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM