[英]How to iterate over hadoop MapWritable in Spark & Elasticsearch
I'm not familiar with both Spark and Scala. 我对Spark和Scala都不熟悉。 I've read some articles on the Internet.
我已经在互联网上阅读了一些文章。 I get documents from Elasticsearch using Spark successfully, but I'm stuck with how to pulling fields from documents.
我可以使用Spark成功地从Elasticsearch获取文档,但是我对如何从文档中拉取字段感到困惑。
I've got 33,617 documents: 我有33,617个文档:
import ...
val conf = new JobConf()
conf.set("es.resource", "index-name/type-name")
conf.set("es.nodes", "hostname1:9200,hostname2:9200")
conf.set("es.query", "{...}")
val esRDD = sc.newAPIHadoopRDD(conf, classOf[EsInputFormat[Text, MapWritable]], classOf[Text], classOf[MapWritable])
scala> esRDD.count() // That's GOOD!
res11: Long = 33617
scala> esRDD.take(5).foreach(row => println(row._2))
{@version=1, field1=a, ...}
{@version=1, field1=a, ...}
{@version=1, field1=b, ...}
{@version=1, field1=b, ...}
{@version=1, field1=b, ...}
I don't know how to use org.apache.hadoop.io.MapWritable
in Scala. 我不知道如何在Scala中使用
org.apache.hadoop.io.MapWritable
。
// Error!!
scala> esRDD.take(5).foreach(row => println(row._2("field1")))
error: org.apache.hadoop.io.MapWritable does not take parameters
esRDD.take(5).foreach(row => println(row._2("field1")))
// Oops. null is printed
scala> esRDD.take(5).foreach(row => println(row._2.get("field1")))
null
null
null
null
null
My final goal is to aggregate by field1
and print their count like this: 我的最终目标是按
field1
进行汇总并按如下方式打印其计数:
scala> esRDD.groupBy(???).mapValues(_.size)
Map(a => 2, b => 3) // How to get this output??
But, I couldn't figure it out. 但是,我无法弄清楚。
$ bin/spark-shell --master local --jars jars/elasticsearch-spark_2.11-2.2.0.jar
scala> import org.elasticsearch.spark._
scala> val rdd: RDD[(String, Map[String, Any])] = sc.esRDD("index-name/type-name")
<console>:45: error: not found: type RDD
val rdd: RDD[(String, Map[String, Any])] = sc.esRDD("index-name/type-name")
^
scala> sc.esRDD("index-name/type-name")
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at org.elasticsearch.spark.rdd.EsSpark$.esRDD(EsSpark.scala:26)
at org.elasticsearch.spark.package$SparkContextFunctions.esRDD(package.scala:20)
Elasticsearch-hadoop has native support for Spark, I would recommend using it - the API is much simpler: Elasticsearch-hadoop具有对Spark的本地支持,我建议使用它-API更加简单:
import org.elasticsearch.spark._
val rdd: RDD[(String, Map[String, Any])] = sc.esRDD("index-name/type-name")
It's a simple rdd of tuples, where the key is the document ID and the Map represents your ES document. 这是一个简单的元组rdd,其中键是文档ID,而Map代表您的ES文档。
You can map it into a different tuple like this: 您可以将其映射到其他元组中,如下所示:
val mapped = rdd.map{ case(id, doc) => (doc.get("field1").get, 1) }
I'm putting 1 since it seems you don't need the doc
anywhere else. 我输入1,因为似乎您在其他任何地方都不需要该
doc
。 And then perform a groupByKey
and a map: 然后执行一个
groupByKey
和一个映射:
mapped.groupByKey().map{ case(key,val) => (key, val.size) }
Also if you're using only the Spark connector you don't need the whole es-hadoop dependency, which is rather big, you can just use elasticsearch-spark 另外,如果仅使用Spark连接器,则不需要整个es-hadoop依赖关系(这相当大),则可以使用elasticsearch-spark
For more information you can check the documentation . 有关更多信息,您可以查看文档 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.