JSoup not translating ampersand in links in html

Question

In JSoup the following test case should pass, it is not.

@Test
public void shouldPrintHrefCorrectly(){
    String content=  "<li><a href=\"#\">Good</a><ul><li><a href=\"article.php?boid=1865&sid=53&mid=1\">" +
            "Boss</a></li><li><a href=\"article.php?boid=186&sid=53&mid=1\">" +
            "heavent</a></li><li><a href=\"article.php?boid=167&sid=53&mid=1\">" +
            "hellos</a></li><li><a href=\"article.php?boid=181&sid=53&mid=1\">" +
            "Mr.Jackson!</a></li>";

    Document document = Jsoup.parse(content, "http://www.google.co.in/");
    Elements links = document.select("a[href^=article]");
    Iterator<Element> iterator = links.iterator();
    List<String> urls = new ArrayList<String>();
    while(iterator.hasNext()){
        urls.add(iterator.next().attr("href"));
    }

    Assert.assertTrue(urls.contains("article.php?boid=181&sid=53&mid=1"));
}

Could any of you please give me the reason as to why it is failing?

Answer 1

There are three problems:

You're asserting that there's a bovikatanid parameter is present, while it's actually called boid .
The HTML source is using & instead of & in the source. This is technically invalid.
Jsoup is parsing &mid as | somehow. It should have scanned until ; .

To fix #1, you have to do it yourself. To fix #2, you have to report this issue to the serveradmin in question (it's their fault, however, since the average browser is forgiving on this, I'd imagine that Google is doing this to save bandwidth). To fix #3, I've reported an issue to the Jsoup guy to see what he thinks about this.

Update : see, Jonathan (the Jsoup guy) has fixed it. It'll be there in the next release.

JSoup not translating ampersand in links in html

Question

1 answers

solution1
1 2011-01-25 12:42:30

JSoup not translating ampersand in links in html

Question

1 answers

solution1 1 2011-01-25 12:42:30

solution1
1 2011-01-25 12:42:30