简体   繁体   English

在Java中翻录HTML页面源的麻烦

[英]Ripping html page source trouble in Java

I'm trying to rip the html page source of a website to get an email. 我正在尝试翻录网站的html页面源以获取电子邮件。 When I run the ripper/dumper or whatever you want to call it, it gets all the source code but stops at line 160 but I can manually go to the webpage>right click>click view page source then parse the text. 当我运行开膛手/自卸车或任何您想调用的东西时,它会获得所有源代码,但会停在第160行,但是我可以手动转到网页>右键单击>单击查看页面源代码,然后解析文本。 The entire source code is a little over 200 lines. 整个源代码略超过200行。 The only problem with manually going to each page and right clicking is that there are over 100k pages and it's gonna take a while. 手动转到每个页面并单击右键的唯一问题是超过100k页面,这将需要一段时间。

Here's the code i'm using to get the page source: 这是我用来获取页面源代码的代码:

    public static void main(String[] args) throws IOException, InterruptedException {

    URL url = new URL("http://www.runelocus.com/forums/member.php?102786-wapetdxzdk&tab=aboutme#aboutme");
    URLConnection connection = url.openConnection();

    connection.setDoInput(true);
    InputStream inStream = connection.getInputStream();
    BufferedReader input = new BufferedReader(new InputStreamReader(
            inStream));

    String html = "";
    String line = "";
    while ((line = input.readLine()) != null)
        html += line;
    System.out.println(html);
    }

If you are trying to scrape the content of an HTML page, you shouldn't be using raw comnections like that. 如果您尝试抓取HTML页面的内容,则不应使用这样的原始连接。 Use existing library: HTML Unit is a very common one to use. 使用现有库: HTML Unit是一种非常常见的库。

You pass in the URL and it gives you an object representing the page and you get all the HTML mark ups as Objects (eg. You get Div object for elements, HTMLAnchor object for elements, etc). 您传入URL,它为您提供了一个代表页面的对象,并且您将所有HTML标记都作为Objects获得(例如,您获得了元素的Div对象,元素的HTMLAnchor对象等)。 It will make your life a lot easier to use existing framework like HTML Unit and read off the content of the page on that. 使用诸如HTML Unit之类的现有框架并阅读其中的页面内容,将使您的生活变得更加轻松。

You can also do searches (eg. elementById, elementByTagName, by attribute, etc) which makes jumping around the document easier given a pre-determined page mark up. 您还可以进行搜索(例如,elementById,elementByTagName,按属性等),从而可以在给定预定页面标记的情况下更轻松地在文档中跳转。

You can also simulate doing clicking, etc as you need to. 您还可以根据需要模拟点击等操作。

I ran your code and it seems to be getting all the HTML including the HTML closing tag. 我运行了您的代码,它似乎正在获取所有HTML,包括HTML结束标记。

Did you think of the possibility that you might have to be logged in on the website to see more? 您是否认为可能必须登录网站才能查看更多信息? In that case a library like user tsOverflow suggests might be helpful. 在这种情况下,类似tsOverflow用户的库可能会有所帮助。

Upon looking at this, my best guess is that your while loop conditional is bad. 看到这个,我最好的猜测是您的while循环条件不好。 I'm unfamiliar with the syntax you're using. 我不熟悉您使用的语法。 Mind you, I have not used Java in awhile. 请注意,我已经有一段时间没有使用Java了。 But I feel like it should read... 但是我觉得应该读...

String line = input.readLine();
while(line != null)
{
    html += line; //should use a StringBuilder here for optimization
    line = input.readLine();
}

I do note the StringBuilder optimization. 我确实注意到StringBuilder优化。 Also, I think this would be easier using the Scanner class. 另外,我认为使用Scanner类会更容易。

Maybe it helps when you open a InputStreamReader with a different charset? 当您打开具有不同字符集的InputStreamReader时,它可能会有所帮助? Looking at the page you mention, the charset is ISO-8859-1: 查看您提到的页面,字符集为ISO-8859-1:

BufferedReader input = 
    new BufferedReader(new InputStreamReader(inStream, "ISO-8859-1"));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM