How to read a text from a web page with Java?

Question

I want to read the text from a web page. I don't want to get the web page's HTML code. I found this code:

    try {
        // Create a URL for the desired page
        URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history");       

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            str = in.readLine().toString();
            System.out.println(str);
            // str is one line of text; readLine() strips the newline character(s)
        }
        in.close();
    } catch (MalformedURLException e) {
    } catch (IOException e) {
    }

but this code gives me the HTML code of the web page. I want to get the whole text inside this page. How can I do this with Java?

Answer 1

You may want to have a look at jsoup for this:

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

This example is an extract from one on their site.

Answer 2

Use JSoup .

You will be able to parse the content using css style selectors.

In this example you can try

Document doc = Jsoup.connect("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history").get(); 
String textContents = doc.select(".newsText").first().text();

Answer 3

You can also use HtmlCleaner jar. Below is the code.

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean( url );

System.out.println( node.getText().toString() );

Answer 4

} catch (MalformedURLException e) {
} catch (IOException e) {
}

add at least e.printStackTrace() Will save you many days of your life

Answer 5

You would have to take the content you get with your current code, then parse it and look for the tags that contains the text you want. A sax parser will be well suited for this job.

Or if it is not a particular piece of text you want, simply remove all tags so that you're left with only the text. I guess you could use regexp for that.

How to read a text from a web page with Java?

Question

5 answers

solution1
15 ACCPTED 2012-03-22 15:59:55

solution2
4 2012-03-22 15:59:22

solution3
0 2013-05-07 08:59:45

solution4
0 2022-01-11 07:52:08

solution5
0 2012-03-22 15:51:55

How to read a text from a web page with Java?

Question

5 answers

solution1 15 ACCPTED 2012-03-22 15:59:55

solution2 4 2012-03-22 15:59:22

solution3 0 2013-05-07 08:59:45

solution4 0 2022-01-11 07:52:08

solution5 0 2012-03-22 15:51:55

solution1
15 ACCPTED 2012-03-22 15:59:55

solution2
4 2012-03-22 15:59:22

solution3
0 2013-05-07 08:59:45

solution4
0 2022-01-11 07:52:08

solution5
0 2012-03-22 15:51:55