简体   繁体   中英

XML processing using Apache Flink

I am a newbie to Apache Flink and distributed processing as well. I have already went through Flink quick setup guide and understand the basics of MapFunctions. But I couldnt find a concrete example for XML processing. I have read about Hadoops XmlInputFormat, but unable to understand how to use it.

My need is, I have huge(100MB) xml file of format as below,

<Class>
    <student>.....</student>
    <student>.....</student>
    .
    .
    .
    <student>.....</student>
</Class>

The flink processor would read the file from HDFS and start processing it(basically iterate through all the student element)

I want to know(in layman's terms), how can I process the xml and creata list of student object.

A simpler layman's explanation would be much appreciated

Apache Mahout's XmlInputFormat for Apache Hadoop extracts the text between two tags (in your case probably <student> and </student> ). Flink provides wrappers to use Hadoop InputFormats, eg, via the readHadoopFile() method of ExecutionEnvironment .

If you do not want to use the XmlInputFormat and if your XML file is nicely formatted, ie, each student record is in a single line, you can use Flink's regular TextInputFormat which reads the file line by line. A subsequent FlatMap function can parse all student lines and filter out all others.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM