如何解析位于HDFS中的XML文件并追加子节点

Question

I am trying to log the counters and bad records of each MR job into an XML file which should be stored in HDFS. 我正在尝试将每个MR作业的计数器和错误记录记录到应存储在HDFS中的XML文件中。 I created a class with static function called LogMessage(), so that all the MR jobs will call this function. 我创建了一个带有静态函数LogMessage（）的类，以便所有MR作业都将调用此函数。 Whenever each MR job calls the LogMessage() it has to append the child nodes(attritbutes in my case) in the xml file, if it already exists. 每当每个MR作业调用LogMessage（）时，都必须在xml文件中附加子节点（在我的情况下为attritbuts）（如果已经存在）。

The problem here is I am unable the parse the XML file that is stored in HDFS to append the new child nodes. 这里的问题是我无法解析存储在HDFS中的XML文件来追加新的子节点。

I am not using XMLInputFormat Reader because, this logging doesnot need any mapreduce program. 我没有使用XMLInputFormat Reader，因为此日志记录不需要任何mapreduce程序。

What I have tried is 我试过的是

public final class LoggingCounter {

    public static int  LogMessage (String Record, String Component ) throws IOException, ParserConfigurationException, TransformerException, SAXException
    { 
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        Path inPath = new Path("hdfs://nameservice1/user/abhime01/haadoop/Rules/AccumulationRule/op/BadMapper-m-00000");
        Path outPath = new Path("hdfs://nameservice1/user/abhime01/logging.xml");

        if (!fs.exists(inPath))
        {
            System.err.println("Input Path " + inPath.toString() + " does not exist.");
            return 1;
        }

        DocumentBuilderFactory documentBuilderFactory =DocumentBuilderFactory.newInstance();
        DocumentBuilder documentBuilder =documentBuilderFactory.newDocumentBuilder();
        Document document;

        FSDataOutputStream fos;
        if (!fs.exists(outPath))
        {           
            fos = fs.create(outPath);
            document = documentBuilder.newDocument();
        }
        else
        {   
            fos= fs.append(outPath);
        }

        final String root = "TransactionLog";
        final String attribute = "Attributes";
        final String elementTS ="TS"; 
        final String elementTSD ="TSD";

        Element rootElement = document.createElement(root); // <TransactionLog>
        document.appendChild(rootElement);

        Element subrootElement = document.createElement(attribute); // <Attributes>
        rootElement.appendChild(subrootElement);

        Element ts = document.createElement(elementTS);  // <TS>
        ts.appendChild(document.createTextNode(Component));
        subrootElement.appendChild(ts);

        Element tsd = document.createElement(elementTSD);  // <TSD>
        BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(inPath)));
        try
        {       
                String writeBadRcrd=null;
                String badRcrdInputline = br.readLine();
                while (badRcrdInputline != null)
                {
                    writeBadRcrd = badRcrdInputline.replaceAll(";","|");
                    tsd.appendChild(document.createTextNode(writeBadRcrd));
                    badRcrdInputline = br.readLine(); //Read the next line to avoid infinite loop

                }


        }
        catch(Exception e)
        {
        }
        finally
        {
              br.close();
        }
        subrootElement.appendChild(tsd);

        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        DOMSource source = new DOMSource(document);
        StreamResult result =  new StreamResult(new StringWriter()); //Read the generated XML and write into HDFS
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "5");
        transformer.transform(source, result);

        try
        {       
                String xmlString = result.getWriter().toString();
                fos.writeBytes(xmlString+"\n");
        }
        catch(Exception e)
        {
        }
        finally
        {
             fos.close();
        }
        return 0;
    }
}

The Output I am getting when the function is called 2 times; 两次调用该函数时得到的输出；

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>ApplyMathRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog>

What I need is : 我需要的是：

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog>

I tried documentBuilder.parse(), but it is unable to parse the file in HDFS, instead searching for a file in local FS. 我尝试了documentBuilder.parse（），但无法解析HDFS中的文件，而是在本地FS中搜索文件。

Please provide some suggestions. 请提供一些建议。

EDIT 1: Instead of trying XML DOM, I tried to create a normal text file as a XML file. 编辑1：我没有尝试使用XML DOM，而是尝试将普通文本文件创建为XML文件。 Below is the code 下面是代码

public final class LoggingCounter {

    public static int  LogMessage (String Record, String Component ) throws IOException
    { 
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        Path inPath = new Path("hdfs://nameservice1/user/abhime01/haadoop/Rules/AccumulationRule/op/BadMapper-m-00000");
        Path outPath = new Path("hdfs://nameservice1/user/abhime01/logging.xml");

        if (!fs.exists(inPath))
        {
            System.err.println("Input Path " + inPath.toString() + " does not exist.");
            return 1;
        }
        String xmlHead = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>";
        String xmlTransactionLogBegin = "<TransactionLog>";
        String xmlTransactionLogEnd = "</TransactionLog>";
        String xmlAttribBegin = "\t<Attributes>";
        String xmlAttribEnd = "\t</Attributes>";

        FSDataOutputStream fos;
        if (!fs.exists(outPath))
        {   
            fos = fs.create(outPath);
            fos.writeBytes(xmlHead+"\n");
            fos.writeBytes(xmlTransactionLogBegin);
        }
        else 
        {
            fos= fs.append(outPath);
        }

        BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(inPath)));
        try
        {       
                String writeBadRcrd;
                String badRcrdInputline = br.readLine();
                fos.writeBytes("\n"+xmlAttribBegin+"\n");
                fos.writeBytes("\t\t<TSD>");
                while (badRcrdInputline != null)
                {
                    writeBadRcrd = badRcrdInputline.replaceAll(";","|");
                    fos.writeBytes(writeBadRcrd);
                    badRcrdInputline = br.readLine(); //Read the next line to avoid infinite loop

                }
                fos.writeBytes("</TSD>\n");
                fos.writeBytes(xmlAttribEnd+"\n");
                fos.writeBytes(xmlTransactionLogEnd);
        }
        catch(Exception e)
        {
        }
        finally
        {
              br.close();
              fos.close();
        }

        return 0;
    }
}

Problem with this code is unable to handle the </TransationLog> . 此代码的问题无法处理</TransationLog> 。 Output what I am getting is 输出我得到的是

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> <Attributes> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog>

The above code is just like any normal java code. 上面的代码就像任何普通的Java代码一样。 Please suggest on handling the that last line ( </TransactionLog> ), before appending the new data. 在添加新数据之前，请建议处理最后一行（ </TransactionLog> ）。

Answer 1

Whenever each MR job calls the LogMessage() it has to append the child nodes(attritbutes in my case) in the xml file, if it already exists. 每当每个MR作业调用LogMessage（）时，都必须在xml文件中附加子节点（在我的情况下为attritbuts）（如果已经存在）。

Are there multiple MR jobs running concurrently that will try to do this? 是否有多个同时运行的MR作业会尝试执行此操作？ If so, then the design could be problematic. 如果是这样，那么设计可能会出现问题。 HDFS has "single writer" semantics, so only one process can hold a file open for writing at a time. HDFS具有“单一作者”语义，因此一次只能有一个进程将一个文件保持打开状态以供写入。 Other process that attempt to open the file for writing/appending would experience an error. 尝试打开文件进行写入/附加的其他过程将遇到错误。

The problem here is I am unable the parse the XML file that is stored in HDFS to append the new child nodes. 这里的问题是我无法解析存储在HDFS中的XML文件来追加新的子节点。

Your current code is generating a whole new XML document for every message, and then appending that whole XML document to the end of the file. 您当前的代码将为每条消息生成一个全新的XML文档，然后将整个XML文档附加到文件末尾。 This explains the current output that you're seeing. 这说明了您所看到的当前输出。 To achieve the desired output, you would need to build the new XML, parse the entire existing document, and then call Node#appendChild(Node) to add the new XML as a child of the <TransactionLog> node. 为了获得所需的输出，您将需要构建新的XML，解析整个现有文档，然后调用Node#appendChild(Node)将新的XML添加为<TransactionLog>节点的子级。

This would be extremely inefficient though. 但是，这将是极其低效的。 Instead, can you consider appending just the <Attributes> nodes, and then wrap the whole file in the containing <TransactionLog> element later as a post-processing step? 相反，您是否可以考虑仅附加<Attributes>节点，然后将整个文件包装在包含的<TransactionLog>元素中，作为后期处理步骤？

I tried documentBuilder.parse(), but it is unable to parse the file in HDFS, instead searching for a file in local FS. 我尝试了documentBuilder.parse（），但无法解析HDFS中的文件，而是在本地FS中搜索文件。

DocumentBuilder#parse(File) cannot accept an HDFS file. DocumentBuilder#parse(File)无法接受HDFS文件。 Like you saw, it will interpret this as referring to the local file system. 如您所见，它将解释为是指本地文件系统。 It would be able to work if you called Hadoop's FileSystem#open(Path) and passed the resulting stream to either DocumentBuilder#parse(InputStream) or DocumentBuilder#parse(InputSource) . 如果您调用Hadoop的FileSystem#open(Path)并将结果流传递给DocumentBuilder#parse(InputStream)或DocumentBuilder#parse(InputSource) ，它将能够正常工作。 However, I still recommend against doing this in your design as per the performance considerations above. 但是，我仍然建议您按照上述性能考虑，在您的设计中不要这样做。

Answer 2

Consider conditionally separating the root element build and transformer output by some indicator of the first and last MR JOB call. 考虑通过第一个和最后一个MR JOB调用的指示符有条件地将根元素构建和变压器输出分开。

In this approach, the document object keeps appending root children nodes and grows until you finally export the string to fos.writeBytes(); 在这种方法中，文档对象会不断追加根子节点，并不断增长，直到最终将字符串导出到fos.writeBytes(); . 。 Of course, you will need to pass firstjob and lastjob in as parameters per function call. 当然，您需要在每个函数调用firstjob和lastjob作为参数传递。 Maybe you could use job numbers to indicate first and last. 也许您可以使用工作编号来指示第一个和最后一个。

    // ONLY FIRST JOB
    if (firstjob) {
       Element rootElement = document.createElement(root); // <TransactionLog>
       document.appendChild(rootElement);        
    }

    // ALL JOBS (DOCUMENT OBJECT GROWS WITH EACH FUNCTION CALL)
    Element subrootElement = document.createElement(attribute); // <Attributes>
    rootElement.appendChild(subrootElement);

    ...rest of XML build...


    // ONLY LAST JOB
    if (lastjob) {
       TransformerFactory transformerFactory = TransformerFactory.newInstance();
       Transformer transformer = transformerFactory.newTransformer();
       DOMSource source = new DOMSource(document);
       StreamResult result =  new StreamResult(new StringWriter()); //Read the generated XML and write into HDFS
       transformer.setOutputProperty(OutputKeys.INDENT, "yes");
       transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "5");
       transformer.transform(source, result);

       try {       
            String xmlString = result.getWriter().toString();
            fos.writeBytes(xmlString+"\n");
       } catch(Exception e) {

       } finally {
            fos.close();
       }

       return 0;
   }

如何解析位于HDFS中的XML文件并追加子节点

问题描述

2 个解决方案

解决方案1
0 2016-01-04 18:40:27

解决方案2
0 2016-01-05 03:43:40

如何解析位于HDFS中的XML文件并追加子节点

问题描述

2 个解决方案

解决方案1 0 2016-01-04 18:40:27

解决方案2 0 2016-01-05 03:43:40

解决方案1
0 2016-01-04 18:40:27

解决方案2
0 2016-01-05 03:43:40