I am a newbie to Apache Flink and distributed processing as well. I have already went through Flink quick setup guide and understand the basics of MapFunctions. But I couldnt find a concrete example for XML processing. I have read about Hadoops XmlInputFormat, but unable to understand how to use it.
My need is, I have huge(100MB) xml file of format as below,
<Class>
<student>.....</student>
<student>.....</student>
.
.
.
<student>.....</student>
</Class>
The flink processor would read the file from HDFS and start processing it(basically iterate through all the student element)
I want to know(in layman's terms), how can I process the xml and creata list of student object.
A simpler layman's explanation would be much appreciated
Apache Mahout's XmlInputFormat
for Apache Hadoop extracts the text between two tags (in your case probably <student>
and </student>
). Flink provides wrappers to use Hadoop InputFormats, eg, via the readHadoopFile()
method of ExecutionEnvironment
.
If you do not want to use the XmlInputFormat
and if your XML file is nicely formatted, ie, each student record is in a single line, you can use Flink's regular TextInputFormat which reads the file line by line. A subsequent FlatMap
function can parse all student lines and filter out all others.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.