如何解析位于HDFS中的XML文件并追加子节点

Question

我正在尝试将每个MR作业的计数器和错误记录记录到应存储在HDFS中的XML文件中。 我创建了一个带有静态函数LogMessage（）的类，以便所有MR作业都将调用此函数。 每当每个MR作业调用LogMessage（）时，都必须在xml文件中附加子节点（在我的情况下为attritbuts）（如果已经存在）。

这里的问题是我无法解析存储在HDFS中的XML文件来追加新的子节点。

我没有使用XMLInputFormat Reader，因为此日志记录不需要任何mapreduce程序。

我试过的是

public final class LoggingCounter {

    public static int  LogMessage (String Record, String Component ) throws IOException, ParserConfigurationException, TransformerException, SAXException
    { 
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        Path inPath = new Path("hdfs://nameservice1/user/abhime01/haadoop/Rules/AccumulationRule/op/BadMapper-m-00000");
        Path outPath = new Path("hdfs://nameservice1/user/abhime01/logging.xml");

        if (!fs.exists(inPath))
        {
            System.err.println("Input Path " + inPath.toString() + " does not exist.");
            return 1;
        }

        DocumentBuilderFactory documentBuilderFactory =DocumentBuilderFactory.newInstance();
        DocumentBuilder documentBuilder =documentBuilderFactory.newDocumentBuilder();
        Document document;

        FSDataOutputStream fos;
        if (!fs.exists(outPath))
        {           
            fos = fs.create(outPath);
            document = documentBuilder.newDocument();
        }
        else
        {   
            fos= fs.append(outPath);
        }

        final String root = "TransactionLog";
        final String attribute = "Attributes";
        final String elementTS ="TS"; 
        final String elementTSD ="TSD";

        Element rootElement = document.createElement(root); // <TransactionLog>
        document.appendChild(rootElement);

        Element subrootElement = document.createElement(attribute); // <Attributes>
        rootElement.appendChild(subrootElement);

        Element ts = document.createElement(elementTS);  // <TS>
        ts.appendChild(document.createTextNode(Component));
        subrootElement.appendChild(ts);

        Element tsd = document.createElement(elementTSD);  // <TSD>
        BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(inPath)));
        try
        {       
                String writeBadRcrd=null;
                String badRcrdInputline = br.readLine();
                while (badRcrdInputline != null)
                {
                    writeBadRcrd = badRcrdInputline.replaceAll(";","|");
                    tsd.appendChild(document.createTextNode(writeBadRcrd));
                    badRcrdInputline = br.readLine(); //Read the next line to avoid infinite loop

                }


        }
        catch(Exception e)
        {
        }
        finally
        {
              br.close();
        }
        subrootElement.appendChild(tsd);

        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        DOMSource source = new DOMSource(document);
        StreamResult result =  new StreamResult(new StringWriter()); //Read the generated XML and write into HDFS
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "5");
        transformer.transform(source, result);

        try
        {       
                String xmlString = result.getWriter().toString();
                fos.writeBytes(xmlString+"\n");
        }
        catch(Exception e)
        {
        }
        finally
        {
             fos.close();
        }
        return 0;
    }
}

两次调用该函数时得到的输出；

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>ApplyMathRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog>

我需要的是：

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog>

我尝试了documentBuilder.parse（），但无法解析HDFS中的文件，而是在本地FS中搜索文件。

请提供一些建议。

编辑1：我没有尝试使用XML DOM，而是尝试将普通文本文件创建为XML文件。 下面是代码

public final class LoggingCounter {

    public static int  LogMessage (String Record, String Component ) throws IOException
    { 
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        Path inPath = new Path("hdfs://nameservice1/user/abhime01/haadoop/Rules/AccumulationRule/op/BadMapper-m-00000");
        Path outPath = new Path("hdfs://nameservice1/user/abhime01/logging.xml");

        if (!fs.exists(inPath))
        {
            System.err.println("Input Path " + inPath.toString() + " does not exist.");
            return 1;
        }
        String xmlHead = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>";
        String xmlTransactionLogBegin = "<TransactionLog>";
        String xmlTransactionLogEnd = "</TransactionLog>";
        String xmlAttribBegin = "\t<Attributes>";
        String xmlAttribEnd = "\t</Attributes>";

        FSDataOutputStream fos;
        if (!fs.exists(outPath))
        {   
            fos = fs.create(outPath);
            fos.writeBytes(xmlHead+"\n");
            fos.writeBytes(xmlTransactionLogBegin);
        }
        else 
        {
            fos= fs.append(outPath);
        }

        BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(inPath)));
        try
        {       
                String writeBadRcrd;
                String badRcrdInputline = br.readLine();
                fos.writeBytes("\n"+xmlAttribBegin+"\n");
                fos.writeBytes("\t\t<TSD>");
                while (badRcrdInputline != null)
                {
                    writeBadRcrd = badRcrdInputline.replaceAll(";","|");
                    fos.writeBytes(writeBadRcrd);
                    badRcrdInputline = br.readLine(); //Read the next line to avoid infinite loop

                }
                fos.writeBytes("</TSD>\n");
                fos.writeBytes(xmlAttribEnd+"\n");
                fos.writeBytes(xmlTransactionLogEnd);
        }
        catch(Exception e)
        {
        }
        finally
        {
              br.close();
              fos.close();
        }

        return 0;
    }
}

此代码的问题无法处理</TransationLog> 。 输出我得到的是

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> <Attributes> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog>

上面的代码就像任何普通的Java代码一样。 在添加新数据之前，请建议处理最后一行（ </TransactionLog> ）。

Answer 1

每当每个MR作业调用LogMessage（）时，都必须在xml文件中附加子节点（在我的情况下为attritbuts）（如果已经存在）。

是否有多个同时运行的MR作业会尝试执行此操作？ 如果是这样，那么设计可能会出现问题。 HDFS具有“单一作者”语义，因此一次只能有一个进程将一个文件保持打开状态以供写入。 尝试打开文件进行写入/附加的其他过程将遇到错误。

这里的问题是我无法解析存储在HDFS中的XML文件来追加新的子节点。

您当前的代码将为每条消息生成一个全新的XML文档，然后将整个XML文档附加到文件末尾。 这说明了您所看到的当前输出。 为了获得所需的输出，您将需要构建新的XML，解析整个现有文档，然后调用Node#appendChild(Node)将新的XML添加为<TransactionLog>节点的子级。

但是，这将是极其低效的。 相反，您是否可以考虑仅附加<Attributes>节点，然后将整个文件包装在包含的<TransactionLog>元素中，作为后期处理步骤？

我尝试了documentBuilder.parse（），但无法解析HDFS中的文件，而是在本地FS中搜索文件。

DocumentBuilder#parse(File)无法接受HDFS文件。 如您所见，它将解释为是指本地文件系统。 如果您调用Hadoop的FileSystem#open(Path)并将结果流传递给DocumentBuilder#parse(InputStream)或DocumentBuilder#parse(InputSource) ，它将能够正常工作。 但是，我仍然建议您按照上述性能考虑，在您的设计中不要这样做。

Answer 2

考虑通过第一个和最后一个MR JOB调用的指示符有条件地将根元素构建和变压器输出分开。

在这种方法中，文档对象会不断追加根子节点，并不断增长，直到最终将字符串导出到fos.writeBytes(); 。 当然，您需要在每个函数调用firstjob和lastjob作为参数传递。 也许您可以使用工作编号来指示第一个和最后一个。

    // ONLY FIRST JOB
    if (firstjob) {
       Element rootElement = document.createElement(root); // <TransactionLog>
       document.appendChild(rootElement);        
    }

    // ALL JOBS (DOCUMENT OBJECT GROWS WITH EACH FUNCTION CALL)
    Element subrootElement = document.createElement(attribute); // <Attributes>
    rootElement.appendChild(subrootElement);

    ...rest of XML build...


    // ONLY LAST JOB
    if (lastjob) {
       TransformerFactory transformerFactory = TransformerFactory.newInstance();
       Transformer transformer = transformerFactory.newTransformer();
       DOMSource source = new DOMSource(document);
       StreamResult result =  new StreamResult(new StringWriter()); //Read the generated XML and write into HDFS
       transformer.setOutputProperty(OutputKeys.INDENT, "yes");
       transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "5");
       transformer.transform(source, result);

       try {       
            String xmlString = result.getWriter().toString();
            fos.writeBytes(xmlString+"\n");
       } catch(Exception e) {

       } finally {
            fos.close();
       }

       return 0;
   }

如何解析位于HDFS中的XML文件并追加子节点

问题描述

2 个解决方案

解决方案1
0 2016-01-04 18:40:27

解决方案2
0 2016-01-05 03:43:40

如何解析位于HDFS中的XML文件并追加子节点

问题描述

2 个解决方案

解决方案1 0 2016-01-04 18:40:27

解决方案2 0 2016-01-05 03:43:40

解决方案1
0 2016-01-04 18:40:27

解决方案2
0 2016-01-05 03:43:40