简体   繁体   English

PySpark:如何在工人上安装linux命令行工具?

[英]PySpark: How do I install a linux command-line tool on workers?

I am trying to use the Linux command-line tool 'Poppler' to extract information from pdf files. 我正在尝试使用Linux命令行工具“ Poppler”从pdf文件中提取信息。 I want to do this for a huge amount of PDFs on several Spark workers. 我想在几个Spark工作者上处理大量PDF。 I need to use Popplers, not PyPDF or anything alike. 我需要使用Popplers,而不是PyPDF等。

Does anybody know how to install Poppler on the workers? 有人知道如何在工人身上安装Poppler吗? I know that I can do command-line calls from within python, and fetch the output (or fetch the generated file by the Poppler lib), but how do I install it on each worker? 我知道我可以在python中进行命令行调用,并获取输出(或由Poppler lib获取生成的文件),但是如何在每个worker上安装它? Im using spark 1.3.1 (databricks). 我正在使用spark 1.3.1(databricks)。

Thank you! 谢谢!

The proper way is to install it on all your workers when you initially set them up as you would install any other Linux application. 正确的方法是,在最初设置工人时将其安装在所有其他工人上,就像安装其他任何Linux应用程序一样。 As you already pointed out, you can then shell out from within Python. 正如您已经指出的那样,您可以从Python内部进行开发。

If that is not an option for whatever reason, then you can ship files to all workers using the addFile method: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.addFile 如果出于某种原因这不是一个选择,那么您可以使用addFile方法将文件运送给所有工作人员: http : addFile

Note that the latter approach does not take care of dependencies (libraries etc.). 请注意,后一种方法不处理依赖项(库等)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python 程序作为 linux 中的命令行工具 - python program as a command-line tool in linux 运行使用Python创建的命令行工具 - Running Command-Line Tool Created with Python 如何在 Linux 命令行中截取 Chromebook 的屏幕截图? - How to take screenshot of Chromebook in Linux command-line? 使用bash或python或其他一些linux命令行工具创建一个dovecot SHA1摘要 - Create a dovecot SHA1 digest using bash or python or some other linux command-line tool 如何从此代码创建命令行工具来执行此操作 - How can I create a command line tool to do this from this code 如何使用 Python 类创建一个简单的登录命令行程序? - How do I create a simple login command-line program using Python classes? 密码通过命令行输入后如何按“ Enter”键? - How do I press “Enter” button after the password is passed through command-line? 如何从命令行编译 Visual Studio 项目? - How do I compile a Visual Studio project from the command-line? 如何在Windows 10命令提示符中运行中心线命令行工具 - How to run centerline command-line tool in windows 10 command promt 如何使用optparse将命令行参数拆分为选项和位置参数? - How do I use optparse to just split the command-line arguments into options and positional args?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM