使用elasticsearch-spark连接器从Spark读取ES：返回所有字段

Question

I've done some experiments in the spark-shell with the elasticsearch-spark connector. 我已经在带有Elasticsearch-Spark连接器的火花壳中进行了一些实验。 Invoking spark: 调用火花：

] $SPARK_HOME/bin/spark-shell --master local[2] --jars ~/spark/jars/elasticsearch-spark-20_2.11-5.1.2.jar

In the scala shell: 在scala shell中：

scala> import org.elasticsearch.spark._
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery")

It works well, the result contains the good records as specified in myquery. 它运作良好，结果包含myquery中指定的良好记录。 The only thing is that I get all the fields, even if I specify a subset of these fields in the query. 唯一的事情是，即使我在查询中指定了这些字段的子集，我也获得了所有字段。 Example: 例：

myquery = """{"query":..., "fields":["a","b"], "size":10}"""

returns all the fields, not only a and b (BTW, I noticed that size parameter is not taken in account neither : result contains more than 10 records). 返回所有字段，不仅返回a和b（顺便说一句，我注意到大小参数均未考虑：结果包含10条以上的记录）。 Maybe it's important to add that fields are nested, a and b are actually doc.a and doc.b. 也许添加字段嵌套很重要，a和b实际上是doc.a和doc.b。

Is it a bug in the connector or do I have the wrong syntax? 这是连接器中的错误还是语法错误？

Answer 1

The spark elasticsearch connector uses fields thus you cannot apply projection. spark elasticsearch连接器使用fields因此您无法应用投影。

If you wish to use fine-grained control over the mapping, you should be using DataFrame instead which are basically RDDs plus schema. 如果您希望对映射使用细粒度的控制，则应该使用DataFrame代替，它基本上是RDD加架构。

pushdown predicate should also be enabled to translate (push-down) Spark SQL into Elasticsearch Query DSL. 还应启用pushdown谓词，以将Spark SQL转换（下推）为Elasticsearch Query DSL。

Now a semi-full example : 现在是一个半完整的示例：

myQuery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
                     .option("query", myQuery)
                     .option("pushdown", "true")
                     .load("myindex/mytype")
                     .limit(10) // instead of size
                     .select("a","b") // instead of fields

Answer 2

怎么打电话：

scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery", Map[String, String] ("es.read.field.include"->"a,b"))

Answer 3

You want restrict fields returned from elasticsearch _search HTTP API? 您是否要限制从elasticsearch _search HTTP API返回的字段？ (I guess to improve download speed). （我想提高下载速度）。

First of all, use a HTTP proxy to see what the elastic4hadoop plugin is doing (I use on MacOS Apache Zeppelin with Charles proxy). 首先，使用HTTP代理查看elastic4hadoop插件的功能（我在MacOS上使用Charles代理的Apache Zeppelin）。 This will help you to understand how pushdown works. 这将帮助您了解下推的工作原理。

There are several solutions to achieve this: 有几种解决方案可以实现此目的：

1. dataframe and pushdown 1.数据框和下推

You specify fields, and the plugin will "forward" to ES (here the _source parameter): 您指定字段，插件将“转发”到ES（此处为_source参数）：

POST ../events/_search?search_type=scan&scroll=5m&size=50&_source=client&preference=_shards%3A3%3B_local

(-) Not fully working for nested fields. （-）不适用于嵌套字段。

(+) Simple, straightaway, easy to read （+）简单，通俗易懂

2. RDD & query fields 2. RDD和查询字段

With JavaEsSpark.esRDD , you can specify fields inside the JSON query, like you did. 使用JavaEsSpark.esRDD ，您可以像以前一样在JSON查询中指定字段。 This only work with RDD (with DataFrame, the fields is not sent). 这仅适用于RDD（使用DataFrame时，不发送字段）。

(-) no dataframe -> no Spark way （-）没有数据框->没有Spark方法

(+) more flexible, more control （+）更灵活，更可控

使用elasticsearch-spark连接器从Spark读取ES：返回所有字段

问题描述

3 个解决方案

解决方案1
3 已采纳 2017-02-04 13:39:09

解决方案2
1 2017-04-11 14:43:44

解决方案3
0 2018-07-09 09:33:34

使用elasticsearch-spark连接器从Spark读取ES：返回所有字段

问题描述

3 个解决方案

解决方案1 3 已采纳 2017-02-04 13:39:09

解决方案2 1 2017-04-11 14:43:44

解决方案3 0 2018-07-09 09:33:34

解决方案1
3 已采纳 2017-02-04 13:39:09

解决方案2
1 2017-04-11 14:43:44

解决方案3
0 2018-07-09 09:33:34