In the spark-shell, I used with success the elasticsearch-hadoop connector (specifically the one developped for spark : elasticsearch-spark-20_2.11-5.1.2.jar). Invoking spark:
] $SPARK_HOME/bin/spark-shell --master local[2] --jars ~/spark/jars/elasticsearch-spark-20_2.11-5.1.2.jar
In the scala shell:
scala> import org.elasticsearch.spark._
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery")
It perfectly works. I want to do the same with pyspark. I tried:
] $SPARK_HOME/bin/pyspark --master local[2] --driver-class-path=/home/pat/spark/jars/elasticsearch-spark-20_2.11-5.1.2.jar
but in the python shell, call to esRDD method is not possible:
>>> sc.esRDD
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SparkContext' object has no attribute 'esRDD'
jar library was loaded because this call works:
>>> conf = {"es.resource" : "myindex/mytype", "es.nodes" : "localhost"}
>>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat","org.apache.hadoop.io.NullWritable","org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
Does someone know how to use esRDD() in pyspark?
esRDD
doesn't exist in pyspark
actually.
Thus it will only work in spark scala and you need to import the following :
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
Now you can read data :
val rdd = sc.esRDD("index_name/doc_type")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.