I am using the elasticsearch-hadoop/spark library to create Spark
RDD
s from ElasticSearch queries.
The esRDD
method returns the raw document ( _source
, in ElasticSearch terms) and the document's id ( _id
in ES), but I also need additional information regarding the returned documents, such as the ElasticSearch index and type each document comes from (this information is always available from the ES REST API).
How can I get the index and type information of documents in the RDD
returned by the esRDD
method?
EDIT
I am querying multiple indices, ie my call to esRDD
looks like this:
sparkContext.esRDD("index*/entities", query)
and the actual indices are "index1", "index2", etc. So, I want to know which specific index each of the entities in the resulting RDD
came from.
In case anyone stumbles upon this in the future:
The solution was to set the es.read.metadata
setting to true
(see here ). This adds a _metadata
field to each document in the esRDD
, which contains info such as the document's index, type, id, version, etc.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.