Ripping html page source trouble in Java

Question

I'm trying to rip the html page source of a website to get an email. When I run the ripper/dumper or whatever you want to call it, it gets all the source code but stops at line 160 but I can manually go to the webpage>right click>click view page source then parse the text. The entire source code is a little over 200 lines. The only problem with manually going to each page and right clicking is that there are over 100k pages and it's gonna take a while.

Here's the code i'm using to get the page source:

    public static void main(String[] args) throws IOException, InterruptedException {

    URL url = new URL("http://www.runelocus.com/forums/member.php?102786-wapetdxzdk&tab=aboutme#aboutme");
    URLConnection connection = url.openConnection();

    connection.setDoInput(true);
    InputStream inStream = connection.getInputStream();
    BufferedReader input = new BufferedReader(new InputStreamReader(
            inStream));

    String html = "";
    String line = "";
    while ((line = input.readLine()) != null)
        html += line;
    System.out.println(html);
    }

Answer 1

If you are trying to scrape the content of an HTML page, you shouldn't be using raw comnections like that. Use existing library: HTML Unit is a very common one to use.

You pass in the URL and it gives you an object representing the page and you get all the HTML mark ups as Objects (eg. You get Div object for elements, HTMLAnchor object for elements, etc). It will make your life a lot easier to use existing framework like HTML Unit and read off the content of the page on that.

You can also do searches (eg. elementById, elementByTagName, by attribute, etc) which makes jumping around the document easier given a pre-determined page mark up.

You can also simulate doing clicking, etc as you need to.

Answer 2

I ran your code and it seems to be getting all the HTML including the HTML closing tag.

Did you think of the possibility that you might have to be logged in on the website to see more? In that case a library like user tsOverflow suggests might be helpful.

Answer 3

Upon looking at this, my best guess is that your while loop conditional is bad. I'm unfamiliar with the syntax you're using. Mind you, I have not used Java in awhile. But I feel like it should read...

String line = input.readLine();
while(line != null)
{
    html += line; //should use a StringBuilder here for optimization
    line = input.readLine();
}

I do note the StringBuilder optimization. Also, I think this would be easier using the Scanner class.

Answer 4

Maybe it helps when you open a InputStreamReader with a different charset? Looking at the page you mention, the charset is ISO-8859-1:

BufferedReader input = 
    new BufferedReader(new InputStreamReader(inStream, "ISO-8859-1"));

Ripping html page source trouble in Java

Question

4 answers

solution1
1 2012-07-09 14:55:01

solution2
0 2012-07-09 15:30:06

solution3
0 2012-07-09 16:13:49

solution4
0 2012-07-09 17:23:06

Ripping html page source trouble in Java

Question

4 answers

solution1 1 2012-07-09 14:55:01

solution2 0 2012-07-09 15:30:06

solution3 0 2012-07-09 16:13:49

solution4 0 2012-07-09 17:23:06

solution1
1 2012-07-09 14:55:01

solution2
0 2012-07-09 15:30:06

solution3
0 2012-07-09 16:13:49

solution4
0 2012-07-09 17:23:06