简体   繁体   中英

Geos, Shapely and Hadoop Streaming

I'm trying to run a Hadoop streaming job to process geospatial data. To that end, I'm using Shapely functions which require libgeos .

However, the job fails because libgeos is not installed on the cluster.

Is there a way to ship libgeos to the cluster and have Shapely read .so files from the directory (maybe by -archives or -files )?

Example of commands run

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D    stream.num.map.output.key.fields=2 -D mapred.text.key.partitioner.options=-k1,1 -archives hdfs://namenode:port/user/anaconda.zip#anaconda -files /some/other/stuff -input /path/to/input -output /user/geo_stuff -file /home/mr_files/mapper.py -mapper "mapper.py"

Where mapper.py starts off like...

import shapely
from cartopy.io import shapereader
from shapely.geometry import Point
...more stuff

And this generates the following error

from shapely.geos import lgeos
File "./anaconda/anaconda/lib/python2.7/site-packages/shapely/geos.py", line 58, in <module>
_lgeos = load_dll('geos_c', fallbacks=['libgeos_c.so.1', 'libgeos_c.so'])

File "./anaconda/anaconda/lib/python2.7/site-packages/shapely/geos.py", line 54, in load_dll
libname, fallbacks or []))

OSError: Could not find library geos_c or load any of its variants ['libgeos_c.so.1', 'libgeos_c.so']

If you want to copy your files from your master node to to all the core nodes on a Hadoop cluster you can do it by running this on your master node ( Key.pem is the secret key you used to ssh into your master node, you'll have to copy it onto your master node before you run this):

nodes=(`hadoop dfsadmin -report | grep Hostname | sed 's/Hostname: //'`)
for workerip in nodes
    scp -i Key.pem -o UserKnownHostsFile=/dev/null \
        -o StrictHostKeyChecking=no \
           /usr/local/lib/libgeos_c* $workerip:/usr/local/lib/

If you have a libgeos_c.so shared library for the C API for GEOS in a non-standard location, you can set an environment variable to use that file:

export GEOS_LIBRARY_PATH=/path/to/libgeos_c.so.1

However you many need to ensure that the dependencies are met. Eg see:

ldd /path/to/libgeos_c.so.1

See the source for libgeos.py to see how environment variables are used to find the GEOS C API shared libraries.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM