[英]Geos, Shapely and Hadoop Streaming
I'm trying to run a Hadoop streaming job to process geospatial data. 我正在尝试运行Hadoop流作业来处理地理空间数据。 To that end, I'm using Shapely functions which require libgeos .
为此,我使用需要libgeos的 Shapely函数。
However, the job fails because libgeos is not installed on the cluster. 但是,该作业失败,因为在群集上未安装libgeos。
Is there a way to ship libgeos to the cluster and have Shapely read .so
files from the directory (maybe by -archives
or -files
)? 有没有办法出货libgeos到群集,并有身材匀称阅读
.so
从目录中的文件(也许-archives
或-files
)?
Example of commands run 命令运行示例
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D stream.num.map.output.key.fields=2 -D mapred.text.key.partitioner.options=-k1,1 -archives hdfs://namenode:port/user/anaconda.zip#anaconda -files /some/other/stuff -input /path/to/input -output /user/geo_stuff -file /home/mr_files/mapper.py -mapper "mapper.py"
Where mapper.py starts off like... 其中mapper.py开始像...
#!./anaconda/anaconda/bin/python
import shapely
from cartopy.io import shapereader
from shapely.geometry import Point
...more stuff
And this generates the following error 这会产生以下错误
from shapely.geos import lgeos
File "./anaconda/anaconda/lib/python2.7/site-packages/shapely/geos.py", line 58, in <module>
_lgeos = load_dll('geos_c', fallbacks=['libgeos_c.so.1', 'libgeos_c.so'])
File "./anaconda/anaconda/lib/python2.7/site-packages/shapely/geos.py", line 54, in load_dll
libname, fallbacks or []))
OSError: Could not find library geos_c or load any of its variants ['libgeos_c.so.1', 'libgeos_c.so']
If you want to copy your files from your master node to to all the core nodes on a Hadoop cluster you can do it by running this on your master node ( Key.pem
is the secret key you used to ssh into your master node, you'll have to copy it onto your master node before you run this): 如果要将文件从主节点复制到Hadoop集群上的所有核心节点,可以通过在主节点上运行它来完成此操作(
Key.pem
是用于ssh进入主节点的秘密密钥,您必须先将其复制到您的主节点上):
#!/bin/bash
nodes=(`hadoop dfsadmin -report | grep Hostname | sed 's/Hostname: //'`)
for workerip in nodes
do
scp -i Key.pem -o UserKnownHostsFile=/dev/null \
-o StrictHostKeyChecking=no \
/usr/local/lib/libgeos_c* $workerip:/usr/local/lib/
done
If you have a libgeos_c.so
shared library for the C API for GEOS in a non-standard location, you can set an environment variable to use that file: 如果在非标准位置具有用于GEOS的C API的
libgeos_c.so
共享库,则可以设置环境变量以使用该文件:
export GEOS_LIBRARY_PATH=/path/to/libgeos_c.so.1
However you many need to ensure that the dependencies are met. 但是,您需要确保满足依赖性。 Eg see:
例如:
ldd /path/to/libgeos_c.so.1
See the source for libgeos.py to see how environment variables are used to find the GEOS C API shared libraries. 请参阅libgeos.py的源代码,以了解如何使用环境变量查找GEOS C API共享库。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.