简体   繁体   中英

Writing json from HDFS to Elasticsearch using elasticsearch-hadoop map-reduce

We have some json data stored into HDFS and we are trying to use elasticsearch-hadoop map reduce to ingest data into Elasticsearch.

The code we used is very simple (below)

public class TestOneFileJob extends Configured implements Tool {

    public static class Tokenizer extends MapReduceBase
            implements Mapper<LongWritable, Text, LongWritable, Text> {

        @Override
        public void map(LongWritable arg0, Text value, OutputCollector<LongWritable, Text> output,
                Reporter reporter) throws IOException {

            output.collect(arg0, value);
        }

    }

    @Override
    public int run(String[] args) throws Exception {

        JobConf job = new JobConf(getConf(), TestOneFileJob.class);

        job.setJobName("demo.mapreduce");
        job.setInputFormat(TextInputFormat.class);
        job.setOutputFormat(EsOutputFormat.class);
        job.setMapperClass(Tokenizer.class);
        job.setSpeculativeExecution(false);

        FileInputFormat.setInputPaths(job, new Path(args[1]));

        job.set("es.resource.write", "{index_name}/live_tweets");

        job.set("es.nodes", "els-test.css.org");

        job.set("es.input.json", "yes");
        job.setMapOutputValueClass(Text.class);

        JobClient.runJob(job);

        return 0;
    }

    public static void main(String[] args) throws Exception {
        System.exit(ToolRunner.run(new TestOneFileJob(), args));
    }
}

This code worked fine but we have two issues with it.

The first issue is the value of es.resource.write property. Currently it is provided by the property index_name from the json.

If the json contains a property of type array like

{
"tags" : [{"tag" : "tag1"}, {"tag" : "tag2"}]
}

How can we configure the es.resource.write to take the first tag value for example?

we tried to use {tags.tag} and {tags[0].tag} but either did not work.

The other issue, how can I make the job index the json document in the two values of the tags property?

We solved the two problems by doing the following

1- In the run method we put the value of es.resource.write as following

job.set("es.resource.write", "{tag}/live_tweets");

2- In the map function we convert the json into an object using gson library

Object currentValue = gson.fromJson(jsonString, Object.class);
  • The object here is the POJO of the json we have

3- From the Object we could extract the tag we want and add its value as a new property to the json.

The previous steps solved the first problem. Regarding the second problem (if we want the same json to be stored into multiple indexes based on the number of tags), we simply looped through the tags in the json and change the tag property we added then pass the json again to the collector. Below is the code required for this step.

@Override
        public void map(LongWritable arg0, Text value, OutputCollector<LongWritable, Text> output, Reporter reporter)
                throws IOException {

            List<String> tags = getTags(value.toString());

            for (String tag : tags) {

                String newJson = value.toString().replaceFirst("\\{", "{\"tag\":\""+tag+"\",");

                output.collect(arg0, new Text(newJson));
            }
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM