如何解析位於HDFS中的XML文件並追加子節點

Question

我正在嘗試將每個MR作業的計數器和錯誤記錄記錄到應存儲在HDFS中的XML文件中。 我創建了一個帶有靜態函數LogMessage（）的類，以便所有MR作業都將調用此函數。 每當每個MR作業調用LogMessage（）時，都必須在xml文件中附加子節點（在我的情況下為attritbuts）（如果已經存在）。

這里的問題是我無法解析存儲在HDFS中的XML文件來追加新的子節點。

我沒有使用XMLInputFormat Reader，因為此日志記錄不需要任何mapreduce程序。

我試過的是

public final class LoggingCounter {

    public static int  LogMessage (String Record, String Component ) throws IOException, ParserConfigurationException, TransformerException, SAXException
    { 
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        Path inPath = new Path("hdfs://nameservice1/user/abhime01/haadoop/Rules/AccumulationRule/op/BadMapper-m-00000");
        Path outPath = new Path("hdfs://nameservice1/user/abhime01/logging.xml");

        if (!fs.exists(inPath))
        {
            System.err.println("Input Path " + inPath.toString() + " does not exist.");
            return 1;
        }

        DocumentBuilderFactory documentBuilderFactory =DocumentBuilderFactory.newInstance();
        DocumentBuilder documentBuilder =documentBuilderFactory.newDocumentBuilder();
        Document document;

        FSDataOutputStream fos;
        if (!fs.exists(outPath))
        {           
            fos = fs.create(outPath);
            document = documentBuilder.newDocument();
        }
        else
        {   
            fos= fs.append(outPath);
        }

        final String root = "TransactionLog";
        final String attribute = "Attributes";
        final String elementTS ="TS"; 
        final String elementTSD ="TSD";

        Element rootElement = document.createElement(root); // <TransactionLog>
        document.appendChild(rootElement);

        Element subrootElement = document.createElement(attribute); // <Attributes>
        rootElement.appendChild(subrootElement);

        Element ts = document.createElement(elementTS);  // <TS>
        ts.appendChild(document.createTextNode(Component));
        subrootElement.appendChild(ts);

        Element tsd = document.createElement(elementTSD);  // <TSD>
        BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(inPath)));
        try
        {       
                String writeBadRcrd=null;
                String badRcrdInputline = br.readLine();
                while (badRcrdInputline != null)
                {
                    writeBadRcrd = badRcrdInputline.replaceAll(";","|");
                    tsd.appendChild(document.createTextNode(writeBadRcrd));
                    badRcrdInputline = br.readLine(); //Read the next line to avoid infinite loop

                }


        }
        catch(Exception e)
        {
        }
        finally
        {
              br.close();
        }
        subrootElement.appendChild(tsd);

        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        DOMSource source = new DOMSource(document);
        StreamResult result =  new StreamResult(new StringWriter()); //Read the generated XML and write into HDFS
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "5");
        transformer.transform(source, result);

        try
        {       
                String xmlString = result.getWriter().toString();
                fos.writeBytes(xmlString+"\n");
        }
        catch(Exception e)
        {
        }
        finally
        {
             fos.close();
        }
        return 0;
    }
}

兩次調用該函數時得到的輸出；

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>ApplyMathRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog>

我需要的是：

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> <Attributes> <TS>AccumulationRule</TS> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog>

我嘗試了documentBuilder.parse（），但無法解析HDFS中的文件，而是在本地FS中搜索文件。

請提供一些建議。

編輯1：我沒有嘗試使用XML DOM，而是嘗試將普通文本文件創建為XML文件。 下面是代碼

public final class LoggingCounter {

    public static int  LogMessage (String Record, String Component ) throws IOException
    { 
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        Path inPath = new Path("hdfs://nameservice1/user/abhime01/haadoop/Rules/AccumulationRule/op/BadMapper-m-00000");
        Path outPath = new Path("hdfs://nameservice1/user/abhime01/logging.xml");

        if (!fs.exists(inPath))
        {
            System.err.println("Input Path " + inPath.toString() + " does not exist.");
            return 1;
        }
        String xmlHead = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>";
        String xmlTransactionLogBegin = "<TransactionLog>";
        String xmlTransactionLogEnd = "</TransactionLog>";
        String xmlAttribBegin = "\t<Attributes>";
        String xmlAttribEnd = "\t</Attributes>";

        FSDataOutputStream fos;
        if (!fs.exists(outPath))
        {   
            fos = fs.create(outPath);
            fos.writeBytes(xmlHead+"\n");
            fos.writeBytes(xmlTransactionLogBegin);
        }
        else 
        {
            fos= fs.append(outPath);
        }

        BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(inPath)));
        try
        {       
                String writeBadRcrd;
                String badRcrdInputline = br.readLine();
                fos.writeBytes("\n"+xmlAttribBegin+"\n");
                fos.writeBytes("\t\t<TSD>");
                while (badRcrdInputline != null)
                {
                    writeBadRcrd = badRcrdInputline.replaceAll(";","|");
                    fos.writeBytes(writeBadRcrd);
                    badRcrdInputline = br.readLine(); //Read the next line to avoid infinite loop

                }
                fos.writeBytes("</TSD>\n");
                fos.writeBytes(xmlAttribEnd+"\n");
                fos.writeBytes(xmlTransactionLogEnd);
        }
        catch(Exception e)
        {
        }
        finally
        {
              br.close();
              fos.close();
        }

        return 0;
    }
}

此代碼的問題無法處理</TransationLog> 。 輸出我得到的是

 <?xml version="1.0" encoding="UTF-8" standalone="no"?> <TransactionLog> <Attributes> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog> <Attributes> <TSD>113|3600024151|3|30|Watermelon|200|20151112|113|3600024151|23|100|Jujubi|201|20151113|113|3600024152|2|40|Blackberry|202|20151114|</TSD> </Attributes> </TransactionLog>

上面的代碼就像任何普通的Java代碼一樣。 在添加新數據之前，請建議處理最后一行（ </TransactionLog> ）。

Answer 1

每當每個MR作業調用LogMessage（）時，都必須在xml文件中附加子節點（在我的情況下為attritbuts）（如果已經存在）。

是否有多個同時運行的MR作業會嘗試執行此操作？ 如果是這樣，那么設計可能會出現問題。 HDFS具有“單一作者”語義，因此一次只能有一個進程將一個文件保持打開狀態以供寫入。 嘗試打開文件進行寫入/附加的其他過程將遇到錯誤。

這里的問題是我無法解析存儲在HDFS中的XML文件來追加新的子節點。

您當前的代碼將為每條消息生成一個全新的XML文檔，然后將整個XML文檔附加到文件末尾。 這說明了您所看到的當前輸出。 為了獲得所需的輸出，您將需要構建新的XML，解析整個現有文檔，然后調用Node#appendChild(Node)將新的XML添加為<TransactionLog>節點的子級。

但是，這將是極其低效的。 相反，您是否可以考慮僅附加<Attributes>節點，然后將整個文件包裝在包含的<TransactionLog>元素中，作為后期處理步驟？

我嘗試了documentBuilder.parse（），但無法解析HDFS中的文件，而是在本地FS中搜索文件。

DocumentBuilder#parse(File)無法接受HDFS文件。 如您所見，它將解釋為是指本地文件系統。 如果您調用Hadoop的FileSystem#open(Path)並將結果流傳遞給DocumentBuilder#parse(InputStream)或DocumentBuilder#parse(InputSource) ，它將能夠正常工作。 但是，我仍然建議您按照上述性能考慮，在您的設計中不要這樣做。

Answer 2

考慮通過第一個和最后一個MR JOB調用的指示符有條件地將根元素構建和變壓器輸出分開。

在這種方法中，文檔對象會不斷追加根子節點，並不斷增長，直到最終將字符串導出到fos.writeBytes(); 。 當然，您需要在每個函數調用firstjob和lastjob作為參數傳遞。 也許您可以使用工作編號來指示第一個和最后一個。

    // ONLY FIRST JOB
    if (firstjob) {
       Element rootElement = document.createElement(root); // <TransactionLog>
       document.appendChild(rootElement);        
    }

    // ALL JOBS (DOCUMENT OBJECT GROWS WITH EACH FUNCTION CALL)
    Element subrootElement = document.createElement(attribute); // <Attributes>
    rootElement.appendChild(subrootElement);

    ...rest of XML build...


    // ONLY LAST JOB
    if (lastjob) {
       TransformerFactory transformerFactory = TransformerFactory.newInstance();
       Transformer transformer = transformerFactory.newTransformer();
       DOMSource source = new DOMSource(document);
       StreamResult result =  new StreamResult(new StringWriter()); //Read the generated XML and write into HDFS
       transformer.setOutputProperty(OutputKeys.INDENT, "yes");
       transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "5");
       transformer.transform(source, result);

       try {       
            String xmlString = result.getWriter().toString();
            fos.writeBytes(xmlString+"\n");
       } catch(Exception e) {

       } finally {
            fos.close();
       }

       return 0;
   }

如何解析位於HDFS中的XML文件並追加子節點

問題描述

2 個解決方案

解決方案1
0 2016-01-04 18:40:27

解決方案2
0 2016-01-05 03:43:40

如何解析位於HDFS中的XML文件並追加子節點

問題描述

2 個解決方案

解決方案1 0 2016-01-04 18:40:27

解決方案2 0 2016-01-05 03:43:40

解決方案1
0 2016-01-04 18:40:27

解決方案2
0 2016-01-05 03:43:40