简体   繁体   English

在Java EE中创建Web爬网程序

[英]Creating a Web Crawler in Java EE

I am creating a web crawler using Java EE Technologies. 我正在使用Java EE技术创建一个Web搜寻器。 I have created a crawler service which contains the result of the WebCrawler in term CrawlerElement objects which contains information of interest to me. 我已经创建了一个CrawlerElement器服务,该服务在术语CrawlerElement对象中包含WebCrawler的结果,其中包含我感兴趣的信息。

Currently I am using JSOUP Library in order to do this. 目前,我正在使用JSOUP库来执行此操作。 But it is not reliable I am attempting the connection three times and also timeout is 10seconds still It is unreliable. 但是,我尝试连接三次是不可靠的,而且超时仍然是10秒。这是不可靠的。

By unreliable I mean even if it can be accessed publicly, It can not be accessed by the crawler program. 所谓不可靠,是指即使可以公开访问它,也无法由搜寻器程序访问。 I know it could be due to robots.txt exclusion but in that also it is allowed but still it is unrealiable. 我知道这可能是由于robots.txt排除在外,但也允许它,但仍然不可靠。

So I decided to go with URLConnection object which has openConnection and then connect method for doing this. 因此,我决定使用具有openConnection URLConnection对象,然后执行connect方法。

I have one more requirement which is bugging me and that is : I have to get the response time in milliseconds for a CrawlerElement which means how many seconds it took to load page B from Page A?? 我还有一个烦人的要求,那就是:我必须获取CrawlerElement的响应时间(以毫秒为单位),这意味着从页面A加载页面B花费了多少秒? and I checked the methods of URLConnection there is no way out in order to do that. 我检查了URLConnection的方法,没有办法这样做。

Any ideas in that topic? 关于这个主题有什么想法吗? Can anyone help me? 谁能帮我?

I was thinking writing a code before and after which takes current time in milliseconds before the gettingContent code and current time in milliseconds subtract and save that milliseconds in database but I was thing whether it would be accurate or not? 我当时在想编写一个代码,在此之前和之后花费当前时间(以毫秒为单位),然后将getContent代码和当前时间(以毫秒为单位)相减并将该毫秒数保存在数据库中,但是我想知道这是否准确?

Thanks in advance. 提前致谢。

EDIT : CURRENT IMPLEMENTATION 编辑:当前执行

Current Implementation which gives me statusCode, contentType etc.. 当前实现,其中提供了statusCode,contentType等。

import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;


public class GetContent {
public static void main(String args[]) throws IOException {
    URL url = new URL("http://www.javacoffeebreak.com/faq/faq0079.html");
    long startTime = System.currentTimeMillis();
    URLConnection uc = url.openConnection();
    uc.setRequestProperty("Authorization", "Basic bG9hbnNkZXY6bG9AbnNkM3Y=");
    uc.setRequestProperty("User-Agent", "");
    uc.connect();
    long endTime = System.currentTimeMillis();
    System.out.println(endTime - startTime);
    String contentType = uc.getContentType();
    System.out.println(contentType);
    String statusCode = uc.getHeaderField(0);
    System.out.println(statusCode);     
   }
}

what say is it okay to do this way or I should use heavy API's like Apache HttpClient or Apache Nutch.. 怎么说可以这样做,否则我应该使用繁重的API,例如Apache HttpClient或Apache Nutch。

It's better to use proven frameworks than reinvent the wheel. 使用经过验证的框架比重新发明轮子要好。 Try Apache Nutch(I recommend 1.x branch, 2.x seems to be too raw). 尝试Apache Nutch(我建议1.x分支,2.x似乎太原始了)。 It will be a lot of pain to implement own crawler with supporting of parallelism, robots.txt / "noindex" metatag, redirects, reliability... There are so many issues to solve. 在支持并行性,robots.txt /“ noindex”元标记,重定向,可靠性的情况下实现自己的搜寻器会非常痛苦。要解决的问题很多。

OK it means you have did work and getting problems in that API/Library.I know it is terrifying to build one thing and then waste that all code and shift to another one but if it would be possible for you As JSoup is just a parser library and it may cause some more problems to you in future so I suggest you to use these more stable API .You can also use crawler4j for that purpose. 好的,这意味着您已经在该API /库中完成了工作并遇到了问题。我知道构建一件事然后浪费所有代码并转移到另一代码是令人恐惧的,但是如果您有可能,因为JSoup只是一个解析器库,将来可能会给您带来更多问题,因此建议您使用这些更稳定的API 。您也可以为此目的使用crawler4j
Here is the list of some open source crawler API's and by doing some R&D you can find a good solution for this :) 这是一些开源爬虫API的列表,通过进行一些研发,您可以找到一个很好的解决方案:)

Try the Apache HttpClient library. 试试Apache HttpClient库。 I've had good results with it. 我的成绩很好。 It seems to be a good bit better for HTTP specific communications. 对于HTTP特定的通信来说似乎更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM