Loading a denormalized Hive table into Elasticsearch with Spark

Question

So, I've found a lot of answers on the opposite, but not this. Now it sounds silly because Elasticsearch works purely with denormalized data, but this is the problem we have. We have a table that is formatted as such:

+----+--------+--------+--------+--------+---------+
| id | attr_1 | attr_2 | attr_3 | attr_4 | fst_nm  |
+----+--------+--------+--------+--------+---------+
|  1 |   2984 |   0324 |  38432 |        | john    |
|  2 |   2343 |  28347 | 238493 |  34923 | patrick |
|  3 |   3293 |   3823 |  38423 |  34823 | george  |
+----+--------+--------+--------+--------+---------+

Where attr_x reprsents the same exact thing, let's say they're foreign keys to another table when this table was separated in a normalized world. So, all of the attrs , existed in a separate table. However, the tables were denoramlized and they were all dumped into one long table. Normally, not too big of an issue to load into Elasticsearch, but this table is massive, about 1000+ columns. We want to take these attrs and store them as an array in Elasticsearch, something like this:

_source: {
  "id": 1,
  "fst_nm": "john",
  "attrs": [
    2984,
    0324,
    38432
  ]
}

Instead of:

_source: {
  "id": 1,
  "fst_nm": "john",
  "attr_1": 2984,
  "attr_2": 0324,
  "attr_3": 38432
}

When we use a default Spark process, it just creates the bottom Elasticsearch document. A couple thoughts I had was to create a new table of the attrs and unpivot them, then query that table by ID, to get the attrs, so it would look something like this:

+-----+--------+
| id  |  attr  |
+-----+--------+
|   1 |   2984 |
|   1 |   0324 |
|   1 |  38432 |
|   2 |   2343 |
| ... |    ... |
|   3 |  34823 |
+-----+--------+

We could then use Spark SQL to query by id on this newly created table, get the attrs, but then how would we insert this into Elasticsearch as an array using Spark?

Another idea I had was to create a new table in Hive, and change the attrs into the Hive complex type of array, but I don't know how I'd do that. Also, if we use Spark to query the table in Hive, when the results come back as an array, is it an easy dump into Elasticsearch?

Answer 1

As for the data transformation part, you can use array to collect several columns into one as arrays, and then you can use .write.json("jsonfile") to write to json files:

import org.apache.spark.sql.functions.col
val attrs = df.columns.filter(_.startsWith("attr")).map(col(_))

val df_array = df.withColumn("attrs", array(attrs:_*)).select("id", "fst_nm", "attrs")

df_array.toJSON.collect
//res8: Array[String] = Array({"id":1,"fst_nm":"john","attrs":[2984,324,38432,null]}, {"id":2,"fst_nm":"patrick","attrs":[2343,28347,238493,34923]})

Write to files:

df_array.write.json("/PATH/TO/jsonfile")

Loading a denormalized Hive table into Elasticsearch with Spark

Question

1 answers

solution1
0 2017-07-23 17:01:12

Loading a denormalized Hive table into Elasticsearch with Spark

Question

1 answers

solution1 0 2017-07-23 17:01:12

solution1
0 2017-07-23 17:01:12