[英]PySpark: How do I install a linux command-line tool on workers?
I am trying to use the Linux command-line tool 'Poppler' to extract information from pdf files. 我正在尝试使用Linux命令行工具“ Poppler”从pdf文件中提取信息。 I want to do this for a huge amount of PDFs on several Spark workers. 我想在几个Spark工作者上处理大量PDF。 I need to use Popplers, not PyPDF or anything alike. 我需要使用Popplers,而不是PyPDF等。
Does anybody know how to install Poppler on the workers? 有人知道如何在工人身上安装Poppler吗? I know that I can do command-line calls from within python, and fetch the output (or fetch the generated file by the Poppler lib), but how do I install it on each worker? 我知道我可以在python中进行命令行调用,并获取输出(或由Poppler lib获取生成的文件),但是如何在每个worker上安装它? Im using spark 1.3.1 (databricks). 我正在使用spark 1.3.1(databricks)。
Thank you! 谢谢!
The proper way is to install it on all your workers when you initially set them up as you would install any other Linux application. 正确的方法是,在最初设置工人时将其安装在所有其他工人上,就像安装其他任何Linux应用程序一样。 As you already pointed out, you can then shell out from within Python. 正如您已经指出的那样,您可以从Python内部进行开发。
If that is not an option for whatever reason, then you can ship files to all workers using the addFile
method: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.addFile 如果出于某种原因这不是一个选择,那么您可以使用addFile
方法将文件运送给所有工作人员: http : addFile
Note that the latter approach does not take care of dependencies (libraries etc.). 请注意,后一种方法不处理依赖项(库等)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.