简体   繁体   中英

How to read data from Elasticsearch to Spark?

I`m trying to read data from ElasticSearch to Apache Spark by python.

Below are the code copied from official documents.

$ ./bin/pyspark --driver-class-path=/path/to/elasticsearch-hadoop.jar
conf = {"es.resource" : "index/type"}    
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",    "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()   

The above can read the data from the corresponding index but it is reading the whole index.

Can you tell me how to use query to limit the read scope?

Also, I did not find much doc regarding this. For example, it seems the conf dict control the read scope but the ES doc just said it is a Hadoop config and nothing more. I go to Hadoop config did not find corresponding key and value regarding ES. Do you know some better articles about this?

You can add a es.query setting to your configuration like this:

conf.set("es.query", "?q=me*")

Here's a more detailed documentation on how to use it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM