简体   繁体   中英

Geos, Shapely and Hadoop Streaming

I'm trying to run a Hadoop streaming job to process geospatial data. To that end, I'm using Shapely functions which require libgeos .

However, the job fails because libgeos is not installed on the cluster.

Is there a way to ship libgeos to the cluster and have Shapely read .so files from the directory (maybe by -archives or -files )?

Example of commands run

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D    stream.num.map.output.key.fields=2 -D mapred.text.key.partitioner.options=-k1,1 -archives hdfs://namenode:port/user/anaconda.zip#anaconda -files /some/other/stuff -input /path/to/input -output /user/geo_stuff -file /home/mr_files/mapper.py -mapper "mapper.py"

Where mapper.py starts off like...

#!./anaconda/anaconda/bin/python
import shapely
from cartopy.io import shapereader
from shapely.geometry import Point
...more stuff

And this generates the following error

from shapely.geos import lgeos
File "./anaconda/anaconda/lib/python2.7/site-packages/shapely/geos.py", line 58, in <module>
_lgeos = load_dll('geos_c', fallbacks=['libgeos_c.so.1', 'libgeos_c.so'])

File "./anaconda/anaconda/lib/python2.7/site-packages/shapely/geos.py", line 54, in load_dll
libname, fallbacks or []))

OSError: Could not find library geos_c or load any of its variants ['libgeos_c.so.1', 'libgeos_c.so']

If you want to copy your files from your master node to to all the core nodes on a Hadoop cluster you can do it by running this on your master node ( Key.pem is the secret key you used to ssh into your master node, you'll have to copy it onto your master node before you run this):

#!/bin/bash
nodes=(`hadoop dfsadmin -report | grep Hostname | sed 's/Hostname: //'`)
for workerip in nodes
do
    scp -i Key.pem -o UserKnownHostsFile=/dev/null \
        -o StrictHostKeyChecking=no \
           /usr/local/lib/libgeos_c* $workerip:/usr/local/lib/
done

If you have a libgeos_c.so shared library for the C API for GEOS in a non-standard location, you can set an environment variable to use that file:

export GEOS_LIBRARY_PATH=/path/to/libgeos_c.so.1

However you many need to ensure that the dependencies are met. Eg see:

ldd /path/to/libgeos_c.so.1

See the source for libgeos.py to see how environment variables are used to find the GEOS C API shared libraries.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM