简体   繁体   中英

How to iterate over hadoop MapWritable in Spark & Elasticsearch

I'm not familiar with both Spark and Scala. I've read some articles on the Internet. I get documents from Elasticsearch using Spark successfully, but I'm stuck with how to pulling fields from documents.

What I've done

I've got 33,617 documents:

import ...

val conf = new JobConf()

conf.set("es.resource", "index-name/type-name")
conf.set("es.nodes", "hostname1:9200,hostname2:9200")
conf.set("es.query", "{...}")

val esRDD = sc.newAPIHadoopRDD(conf, classOf[EsInputFormat[Text, MapWritable]], classOf[Text], classOf[MapWritable])


scala> esRDD.count() // That's GOOD!
res11: Long = 33617

scala> esRDD.take(5).foreach(row => println(row._2))
{@version=1, field1=a, ...}
{@version=1, field1=a, ...} 
{@version=1, field1=b, ...}
{@version=1, field1=b, ...}
{@version=1, field1=b, ...}

Question1: How to print a specific field.

I don't know how to use org.apache.hadoop.io.MapWritable in Scala.

// Error!!
scala> esRDD.take(5).foreach(row => println(row._2("field1")))
error: org.apache.hadoop.io.MapWritable does not take parameters
              esRDD.take(5).foreach(row => println(row._2("field1")))

// Oops. null is printed
scala> esRDD.take(5).foreach(row => println(row._2.get("field1")))
null
null
null
null
null

Question2: How to Group By Count

My final goal is to aggregate by field1 and print their count like this:

scala> esRDD.groupBy(???).mapValues(_.size)
Map(a => 2, b => 3) // How to get this output??

But, I couldn't figure it out.

@Mateusz's answer test

$ bin/spark-shell --master local --jars jars/elasticsearch-spark_2.11-2.2.0.jar

scala> import org.elasticsearch.spark._

scala> val rdd: RDD[(String, Map[String, Any])] = sc.esRDD("index-name/type-name")
<console>:45: error: not found: type RDD
          val rdd: RDD[(String, Map[String, Any])] = sc.esRDD("index-name/type-name")
                   ^

scala> sc.esRDD("index-name/type-name")
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
        at org.elasticsearch.spark.rdd.EsSpark$.esRDD(EsSpark.scala:26)
        at org.elasticsearch.spark.package$SparkContextFunctions.esRDD(package.scala:20)

Elasticsearch-hadoop has native support for Spark, I would recommend using it - the API is much simpler:

import org.elasticsearch.spark._        

val rdd: RDD[(String, Map[String, Any])] = sc.esRDD("index-name/type-name")

It's a simple rdd of tuples, where the key is the document ID and the Map represents your ES document.

You can map it into a different tuple like this:

val mapped = rdd.map{ case(id, doc) => (doc.get("field1").get, 1) }

I'm putting 1 since it seems you don't need the doc anywhere else. And then perform a groupByKey and a map:

mapped.groupByKey().map{ case(key,val) => (key, val.size) }

Also if you're using only the Spark connector you don't need the whole es-hadoop dependency, which is rather big, you can just use elasticsearch-spark

For more information you can check the documentation .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM