简体   繁体   中英

Illegal characters in XML - java

I'm creating a program which checks the legitimacy of a given URL. I've already created my own algorithm for this, but now I want to add PhishTank's services into my program.

They provide services where you can directly query a URL from their website, but they have set a certain quota on the number of queries you can make per day. The other option, which I'm going with, is to simply download their database and work with it locally, without restrictions.

The file you get is in XML, and found some code to test with, but it seems like their XML contains illegal characters (such as unicode 0x07 -- the [BEL] character) inside CDATA, and so the parsing throws me an exception.

<url><![CDATA[http://shaghaf-edu.com/sign-in/??msg=InvalidOnlineIdException&amp;id[BEL]da9ca9b23227a572d1fb5ff4ff91e3&amp;lpOlbResetErrorCounter=0l=&amp;request_locale=en-us]]></url>

I've done a bit of searching and all I've found is solutions that seem fine to rather small XML-files. The one I'm working with is close to 2.7 million lines -- I'm not sure how efficiently a regex would work in this case or a char-to-char comparison.

I should note that their database is updated hourly, and has to be redownloaded. So cleaning the file once manually isn't an option.

So I'm wondering if there is any fast and efficient way of solving this problem?

I don't have the exact code with me, but I use is a very slight variation of this which I found here on StackOverflow:

private void start() throws Exception
{
    URL url = new URL("http://localhost:8080/AutoLogin/resource/web.xml");
    URLConnection connection = url.openConnection();

    Document doc = parseXML(connection.getInputStream());
    NodeList descNodes = doc.getElementsByTagName("description");

    for(int i=0; i<descNodes.getLength();i++)
    {
        System.out.println(descNodes.item(i).getTextContent());
    }
}

private Document parseXML(InputStream stream)
throws Exception
{
    DocumentBuilderFactory objDocumentBuilderFactory = null;
    DocumentBuilder objDocumentBuilder = null;
    Document doc = null;
    try
    {
        objDocumentBuilderFactory = DocumentBuilderFactory.newInstance();
        objDocumentBuilder = objDocumentBuilderFactory.newDocumentBuilder();

        doc = objDocumentBuilder.parse(stream);
    }
    catch(Exception ex)
    {
        throw ex;
    }       

    return doc;
}

Answering by asking a question ...

Why not write a simple pre-processing utility?

It could read the XML file as is (line by line); and do whatever is required to turn that content into "correct" XML.

In other words: you should explicitly distinguish between the task of "preparing your input", and "actually working that xml input". This will also make it much easier to do fine tuning. If you find that regular expressions are too expensive; then just change the the "pre-processor" to not use them. And afterwards, easily measure the effects on runtime ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM