简体   繁体   English

如何从Elasticsearch读取数据到Spark?

[英]How to read data from Elasticsearch to Spark?

I`m trying to read data from ElasticSearch to Apache Spark by python. 我正在尝试通过python将数据从ElasticSearch读取到Apache Spark。

Below are the code copied from official documents. 以下是从官方文档复制的代码。

$ ./bin/pyspark --driver-class-path=/path/to/elasticsearch-hadoop.jar
conf = {"es.resource" : "index/type"}    
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",    "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()   

The above can read the data from the corresponding index but it is reading the whole index. 上面的代码可以从相应的索引中读取数据,但是它正在读取整个索引。

Can you tell me how to use query to limit the read scope? 您能告诉我如何使用查询来限制读取范围吗?

Also, I did not find much doc regarding this. 另外,我没有找到与此相关的文档。 For example, it seems the conf dict control the read scope but the ES doc just said it is a Hadoop config and nothing more. 例如,似乎该矛盾控制着读取范围,但是ES文档只是说这是Hadoop配置,仅此而已。 I go to Hadoop config did not find corresponding key and value regarding ES. 我去Hadoop config找不到与ES相关的键和值。 Do you know some better articles about this? 您知道一些更好的文章吗?

You can add a es.query setting to your configuration like this: 您可以将es.query设置添加到您的配置中,如下所示:

conf.set("es.query", "?q=me*")

Here's a more detailed documentation on how to use it. 这是有关如何使用它的更详细的文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM