How to read data from Elasticsearch to Spark?

Question

I`m trying to read data from ElasticSearch to Apache Spark by python.

Below are the code copied from official documents.

$ ./bin/pyspark --driver-class-path=/path/to/elasticsearch-hadoop.jar
conf = {"es.resource" : "index/type"}    
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",    "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()

The above can read the data from the corresponding index but it is reading the whole index.

Can you tell me how to use query to limit the read scope?

Also, I did not find much doc regarding this. For example, it seems the conf dict control the read scope but the ES doc just said it is a Hadoop config and nothing more. I go to Hadoop config did not find corresponding key and value regarding ES. Do you know some better articles about this?

Answer 1

You can add a es.query setting to your configuration like this:

conf.set("es.query", "?q=me*")

Here's a more detailed documentation on how to use it.

How to read data from Elasticsearch to Spark?

Question

1 answers

solution1
2 ACCPTED 2016-03-14 08:39:15

How to read data from Elasticsearch to Spark?

Question

1 answers

solution1 2 ACCPTED 2016-03-14 08:39:15

solution1
2 ACCPTED 2016-03-14 08:39:15