XML processing using Apache Flink

Question

I am a newbie to Apache Flink and distributed processing as well. I have already went through Flink quick setup guide and understand the basics of MapFunctions. But I couldnt find a concrete example for XML processing. I have read about Hadoops XmlInputFormat, but unable to understand how to use it.

My need is, I have huge(100MB) xml file of format as below,

<Class>
    <student>.....</student>
    <student>.....</student>
    .
    .
    .
    <student>.....</student>
</Class>

The flink processor would read the file from HDFS and start processing it(basically iterate through all the student element)

I want to know(in layman's terms), how can I process the xml and creata list of student object.

A simpler layman's explanation would be much appreciated

Answer 1

Apache Mahout's XmlInputFormat for Apache Hadoop extracts the text between two tags (in your case probably <student> and </student> ). Flink provides wrappers to use Hadoop InputFormats, eg, via the readHadoopFile() method of ExecutionEnvironment .

If you do not want to use the XmlInputFormat and if your XML file is nicely formatted, ie, each student record is in a single line, you can use Flink's regular TextInputFormat which reads the file line by line. A subsequent FlatMap function can parse all student lines and filter out all others.

XML processing using Apache Flink

Question

1 answers

solution1
0 2016-10-24 22:26:51

XML processing using Apache Flink

Question

1 answers

solution1 0 2016-10-24 22:26:51

solution1
0 2016-10-24 22:26:51