简体   繁体   中英

How to parse a XML file located in HDFS and append the child nodes

I am trying to log the counters and bad records of each MR job into an XML file which should be stored in HDFS. I created a class with static function called LogMessage(), so that all the MR jobs will call this function. Whenever each MR job calls the LogMessage() it has to append the child nodes(attritbutes in my case) in the xml file, if it already exists.

The problem here is I am unable the parse the XML file that is stored in HDFS to append the new child nodes.

I am not using XMLInputFormat Reader because, this logging doesnot need any mapreduce program.

What I have tried is

public final class LoggingCounter {

    public static int  LogMessage (String Record, String Component ) throws IOException, ParserConfigurationException, TransformerException, SAXException
    { 
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        Path inPath = new Path("hdfs://nameservice1/user/abhime01/haadoop/Rules/AccumulationRule/op/BadMapper-m-00000");
        Path outPath = new Path("hdfs://nameservice1/user/abhime01/logging.xml");

        if (!fs.exists(inPath))
        {
            System.err.println("Input Path " + inPath.toString() + " does not exist.");
            return 1;
        }

        DocumentBuilderFactory documentBuilderFactory =DocumentBuilderFactory.newInstance();
        DocumentBuilder documentBuilder =documentBuilderFactory.newDocumentBuilder();
        Document document;

        FSDataOutputStream fos;
        if (!fs.exists(outPath))
        {           
            fos = fs.create(outPath);
            document = documentBuilder.newDocument();
        }
        else
        {   
            fos= fs.append(outPath);
        }

        final String root = "TransactionLog";
        final String attribute = "Attributes";
        final String elementTS ="TS"; 
        final String elementTSD ="TSD";

        Element rootElement = document.createElement(root); // <TransactionLog>
        document.appendChild(rootElement);

        Element subrootElement = document.createElement(attribute); // <Attributes>
        rootElement.appendChild(subrootElement);

        Element ts = document.createElement(elementTS);  // <TS>
        ts.appendChild(document.createTextNode(Component));
        subrootElement.appendChild(ts);

        Element tsd = document.createElement(elementTSD);  // <TSD>
        BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(inPath)));
        try
        {       
                String writeBadRcrd=null;
                String badRcrdInputline = br.readLine();
                while (badRcrdInputline != null)
                {
                    writeBadRcrd = badRcrdInputline.replaceAll(";","|");
                    tsd.appendChild(document.createTextNode(writeBadRcrd));
                    badRcrdInputline = br.readLine(); //Read the next line to avoid infinite loop

                }


        }
        catch(Exception e)
        {
        }
        finally
        {
              br.close();
        }
        subrootElement.appendChild(tsd);

        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        DOMSource source = new DOMSource(document);
        StreamResult result =  new StreamResult(new StringWriter()); //Read the generated XML and write into HDFS
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "5");
        transformer.transform(source, result);

        try
        {       
                String xmlString = result.getWriter().toString();
                fos.writeBytes(xmlString+"\n");
        }
        catch(Exception e)
        {
        }
        finally
        {
             fos.close();
        }
        return 0;
    }
}

The Output I am getting when the function is called 2 times;

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>ApplyMathRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> 

What I need is :

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> 

I tried documentBuilder.parse(), but it is unable to parse the file in HDFS, instead searching for a file in local FS.

Please provide some suggestions.

EDIT 1: Instead of trying XML DOM, I tried to create a normal text file as a XML file. Below is the code

public final class LoggingCounter {

    public static int  LogMessage (String Record, String Component ) throws IOException
    { 
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        Path inPath = new Path("hdfs://nameservice1/user/abhime01/haadoop/Rules/AccumulationRule/op/BadMapper-m-00000");
        Path outPath = new Path("hdfs://nameservice1/user/abhime01/logging.xml");

        if (!fs.exists(inPath))
        {
            System.err.println("Input Path " + inPath.toString() + " does not exist.");
            return 1;
        }
        String xmlHead = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>";
        String xmlTransactionLogBegin = "<TransactionLog>";
        String xmlTransactionLogEnd = "</TransactionLog>";
        String xmlAttribBegin = "\t<Attributes>";
        String xmlAttribEnd = "\t</Attributes>";

        FSDataOutputStream fos;
        if (!fs.exists(outPath))
        {   
            fos = fs.create(outPath);
            fos.writeBytes(xmlHead+"\n");
            fos.writeBytes(xmlTransactionLogBegin);
        }
        else 
        {
            fos= fs.append(outPath);
        }

        BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(inPath)));
        try
        {       
                String writeBadRcrd;
                String badRcrdInputline = br.readLine();
                fos.writeBytes("\n"+xmlAttribBegin+"\n");
                fos.writeBytes("\t\t<TSD>");
                while (badRcrdInputline != null)
                {
                    writeBadRcrd = badRcrdInputline.replaceAll(";","|");
                    fos.writeBytes(writeBadRcrd);
                    badRcrdInputline = br.readLine(); //Read the next line to avoid infinite loop

                }
                fos.writeBytes("</TSD>\n");
                fos.writeBytes(xmlAttribEnd+"\n");
                fos.writeBytes(xmlTransactionLogEnd);
        }
        catch(Exception e)
        {
        }
        finally
        {
              br.close();
              fos.close();
        }

        return 0;
    }
}

Problem with this code is unable to handle the </TransationLog> . Output what I am getting is

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> <Attributes> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> 

The above code is just like any normal java code. Please suggest on handling the that last line ( </TransactionLog> ), before appending the new data.

Whenever each MR job calls the LogMessage() it has to append the child nodes(attritbutes in my case) in the xml file, if it already exists.

Are there multiple MR jobs running concurrently that will try to do this? If so, then the design could be problematic. HDFS has "single writer" semantics, so only one process can hold a file open for writing at a time. Other process that attempt to open the file for writing/appending would experience an error.

The problem here is I am unable the parse the XML file that is stored in HDFS to append the new child nodes.

Your current code is generating a whole new XML document for every message, and then appending that whole XML document to the end of the file. This explains the current output that you're seeing. To achieve the desired output, you would need to build the new XML, parse the entire existing document, and then call Node#appendChild(Node) to add the new XML as a child of the <TransactionLog> node.

This would be extremely inefficient though. Instead, can you consider appending just the <Attributes> nodes, and then wrap the whole file in the containing <TransactionLog> element later as a post-processing step?

I tried documentBuilder.parse(), but it is unable to parse the file in HDFS, instead searching for a file in local FS.

DocumentBuilder#parse(File) cannot accept an HDFS file. Like you saw, it will interpret this as referring to the local file system. It would be able to work if you called Hadoop's FileSystem#open(Path) and passed the resulting stream to either DocumentBuilder#parse(InputStream) or DocumentBuilder#parse(InputSource) . However, I still recommend against doing this in your design as per the performance considerations above.

Consider conditionally separating the root element build and transformer output by some indicator of the first and last MR JOB call.

In this approach, the document object keeps appending root children nodes and grows until you finally export the string to fos.writeBytes(); . Of course, you will need to pass firstjob and lastjob in as parameters per function call. Maybe you could use job numbers to indicate first and last.

    // ONLY FIRST JOB
    if (firstjob) {
       Element rootElement = document.createElement(root); // <TransactionLog>
       document.appendChild(rootElement);        
    }

    // ALL JOBS (DOCUMENT OBJECT GROWS WITH EACH FUNCTION CALL)
    Element subrootElement = document.createElement(attribute); // <Attributes>
    rootElement.appendChild(subrootElement);

    ...rest of XML build...


    // ONLY LAST JOB
    if (lastjob) {
       TransformerFactory transformerFactory = TransformerFactory.newInstance();
       Transformer transformer = transformerFactory.newTransformer();
       DOMSource source = new DOMSource(document);
       StreamResult result =  new StreamResult(new StringWriter()); //Read the generated XML and write into HDFS
       transformer.setOutputProperty(OutputKeys.INDENT, "yes");
       transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "5");
       transformer.transform(source, result);

       try {       
            String xmlString = result.getWriter().toString();
            fos.writeBytes(xmlString+"\n");
       } catch(Exception e) {

       } finally {
            fos.close();
       }

       return 0;
   }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM