简体   繁体   English

SAXParseException 问题(如何从 xml 文件中去除任何 BOM 字符)

[英]SAXParseException question (how to strip any BOM characters from an xml file)

I have some data in an xml file and I am using the Process library to parse thru that file.我在 xml 文件中有一些数据,我正在使用 Process 库来解析该文件。 I ran into the BOM marker issue , that caused some errors to be thrown.我遇到了BOM 标记问题,导致抛出一些错误。 I found a work around elsewhere, which is very slow: I'm using Apache Commons BOMInputStream to read the file as a bunch of bytes, after skipping the ones that represent that BOM data.我在其他地方找到了一个解决方法,这很慢:在跳过代表该 BOM 数据的那些字节之后,我使用 Apache Commons BOMInputStream 将文件作为一堆字节读取。

I think that the source of my problem is actually my lack of knowledge about streams, readers and writers.我认为我的问题的根源实际上是我对流、读者和作者缺乏了解。 There are so many different readers and writers and all kinds of "streams" (a word I barely understand) that I want to pull my hair out trying to figure out which one to use and how.这么多不同的读者和作家以及各种各样的“流”(我几乎不理解这个词),我想拔出头发试图弄清楚使用哪个以及如何使用。 I think I just picked the wrong implementation.我想我只是选择了错误的实现。

Question: Can someone show me why my code is so slow, and also help me improve my understanding of file i/o?问题:有人可以告诉我为什么我的代码这么慢,并帮助我提高对文件 i/o 的理解吗?

Code:代码:

private static XML noBOM(String filename, PApplet p) throws FileNotFoundException, IOException{

    ByteArrayOutputStream out = new ByteArrayOutputStream();
    File f = new File(filename);
    InputStream stream = new FileInputStream(f);
    BOMInputStream bomIn = new BOMInputStream(stream);

    int tmp = -1;
    while ((tmp = bomIn.read()) != -1){
        out.write(tmp);
    }

    String strXml = out.toString();
    return p.parseXML(strXml);
}

public static Map<String, Float> lifeExpectancyFromXML(String filename, PApplet p, 
        int year) throws FileNotFoundException, IOException{


    Map<String, Float> dataMap = new HashMap<>();

    XML xml = noBOM(filename, p);

    if(xml != null){

        XML[] records = xml.getChild("data").getChildren("record");

        for (XML record : records){
            XML[] fields = record.getChildren("field");

            String country = fields[0].getContent();
            int entryYear = fields[2].getIntContent();
            float lifeEx = fields[3].getFloatContent();

            if (entryYear == year){
                System.out.println("Country: " + country);
                System.out.println("Life Expectency: " + lifeEx);
                dataMap.put(country, lifeEx);
            }
        }
    } 
    else {
        System.out.println("String could not be parsed.");
    }

    return dataMap;
} 

Problem is probably, that InputStream is read byte by byte.问题可能是, InputStream 是逐字节读取的。 Try to use buffer to make it more performant:尝试使用缓冲区来提高性能:

try (BOMInputStream bis = new BOMInputStream(new FileInputStream(new File(filename)))) {
    byte[] buffer = new byte[1000];
    while (bis.read(buffer) != -1) {
        out.write(buffer);
    }
}

Updated:更新:

Resulting ByteArrayOutputStream may contain some empty bytes in the end.结果 ByteArrayOutputStream 最后可能包含一些空字节。 To remove them trim the resulting string:要删除它们,请修剪结果字符串:

out.toString("UTF-8").trim()

My solution was to use BufferedReader instead of creating my own buffer.我的解决方案是使用 BufferedReader 而不是创建我自己的缓冲区。 It made everything quite speedy:它使一切变得非常迅速:

private static XML noBOM(String path, PApplet p) throws 
            FileNotFoundException, UnsupportedEncodingException, IOException{

        //set default encoding
        String defaultEncoding = "UTF-8";

        //create BOMInputStream to get rid of any Byte Order Mark
        BOMInputStream bomIn = new BOMInputStream(new FileInputStream(path));

        //If BOM is present, determine encoding. If not, use UTF-8
        ByteOrderMark bom = bomIn.getBOM();
        String charSet = bom == null ? defaultEncoding : bom.getCharsetName();

        //get buffered reader for speed
        InputStreamReader reader = new InputStreamReader(bomIn, charSet);
        BufferedReader breader = new BufferedReader(reader);

        //Build string to parse into XML using Processing's PApplet.parsXML
        StringBuilder buildXML = new StringBuilder();
        int c;
        while((c = breader.read()) != -1){
            buildXML.append((char) c);
        }
        reader.close();
        return p.parseXML(buildXML.toString());
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM