简体   繁体   中英

JSoup not translating ampersand in links in html

In JSoup the following test case should pass, it is not.

@Test
public void shouldPrintHrefCorrectly(){
    String content=  "<li><a href=\"#\">Good</a><ul><li><a href=\"article.php?boid=1865&sid=53&mid=1\">" +
            "Boss</a></li><li><a href=\"article.php?boid=186&sid=53&mid=1\">" +
            "heavent</a></li><li><a href=\"article.php?boid=167&sid=53&mid=1\">" +
            "hellos</a></li><li><a href=\"article.php?boid=181&sid=53&mid=1\">" +
            "Mr.Jackson!</a></li>";

    Document document = Jsoup.parse(content, "http://www.google.co.in/");
    Elements links = document.select("a[href^=article]");
    Iterator<Element> iterator = links.iterator();
    List<String> urls = new ArrayList<String>();
    while(iterator.hasNext()){
        urls.add(iterator.next().attr("href"));
    }

    Assert.assertTrue(urls.contains("article.php?boid=181&sid=53&mid=1"));
}

Could any of you please give me the reason as to why it is failing?

There are three problems:

  1. You're asserting that there's a bovikatanid parameter is present, while it's actually called boid .

  2. The HTML source is using & instead of &amp; in the source. This is technically invalid.

  3. Jsoup is parsing &mid as | somehow. It should have scanned until ; .

To fix #1, you have to do it yourself. To fix #2, you have to report this issue to the serveradmin in question (it's their fault, however, since the average browser is forgiving on this, I'd imagine that Google is doing this to save bandwidth). To fix #3, I've reported an issue to the Jsoup guy to see what he thinks about this.


Update : see, Jonathan (the Jsoup guy) has fixed it. It'll be there in the next release.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM