简体   繁体   English

使用crawler4j爬行和提取信息

[英]Crawling and extracting info using crawler4j

I need help figuring out how to crawl through this page: http://www.marinetraffic.com/en/ais/index/ports/all go through each port, and extract the name and coordinates and write them onto a file. 我需要帮助弄清楚如何爬网此页面: http : //www.marinetraffic.com/en/ais/index/ports/all遍历每个端口,提取名称和坐标并将它们写到文件中。 The main class looks as follows: 主类如下所示:

import java.io.FileWriter;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;


public class WorldPortSourceCrawler {

    public static void main(String[] args) throws Exception {
         String crawlStorageFolder = "data";
         int numberOfCrawlers = 5;

         CrawlConfig config = new CrawlConfig();
         config.setCrawlStorageFolder(crawlStorageFolder);
         config.setMaxDepthOfCrawling(2);
         config.setUserAgentString("Sorry for any inconvenience, I am trying to keep the traffic low per second");
         //config.setPolitenessDelay(20);
         /*
          * Instantiate the controller for this crawl.
          */
         PageFetcher pageFetcher = new PageFetcher(config);
         RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
         RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
         CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

         /*
          * For each crawl, you need to add some seed urls. These are the first
          * URLs that are fetched and then the crawler starts following links
          * which are found in these pages
          */
         controller.addSeed("http://www.marinetraffic.com/en/ais/index/ports/all");

         /*
          * Start the crawl. This is a blocking operation, meaning that your code
          * will reach the line after this only when crawling is finished.
          */
         controller.start(PortExtractor.class, numberOfCrawlers);    

         System.out.println("finished reading");
         System.out.println("Ports: " + PortExtractor.portList.size());
         FileWriter writer = new FileWriter("PortInfo2.txt");

         System.out.println("Writing to file...");
         for(Port p : PortExtractor.portList){
            writer.append(p.print() + "\n");
            writer.flush();
         }
         writer.close();
        System.out.println("File written");
        }
}

While the Port Extractor looks like this: 虽然Port Extractor看起来像这样:

public class PortExtractor extends WebCrawler{

    private final static Pattern FILTERS = Pattern.compile(
            ".*(\\.(css|js|bmp|gif|jpe?g"
            + "|png|tiff?|mid|mp2|mp3|mp4"
            + "|wav|avi|mov|mpeg|ram|m4v|pdf"
            + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"
        );

    public static List<Port> portList = new ArrayList<Port>();

/**
*
* Crawling logic
*/
//@Override
public boolean shouldVisit(WebURL url) {

String href = url.getURL().toLowerCase();
//return  !FILTERS.matcher(href).matches()&&href.startsWith("http://www.worldportsource.com/countries.php") && !href.contains("/shipping/") && !href.contains("/cruising/") && !href.contains("/Today's Port of Call/") && !href.contains("/cruising/") && !href.contains("/portcall/") && !href.contains("/localviews/") && !href.contains("/commerce/")&& !href.contains("/maps/") && !href.contains("/waterways/");
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.marinetraffic.com/en/ais/index/ports/all");
}



/**
* This function is called when a page is fetched and ready 
* to be processed
*/
@Override
public void visit(Page page) {          
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);

   }

}

How do I go about writing the html parser, also how can I specify to the program that it should not crawl through anything other than the port info links? 我该如何编写html解析器,还应如何向程序指定它不应通过端口信息链接以外的任何内容进行爬网? I'm having difficulty with this as even with the code running, it breaks everytime I try to work with the HTML parsing. 我很难做到这一点,即使在运行代码的情况下,每次尝试使用HTML解析时都会中断。 Please any help would be much appreciated. 请任何帮助将不胜感激。

First task is to check the robots.txt of the site in order to checkl, whether crawler4j will acutally crawl this website. 第一项任务是检查网站的robots.txt以检查l, crawler4j是否将自动爬网该网站。 Investigating this file, we find, that this will no problem: 研究此文件,我们发现这将没有问题:

User-agent: *
Allow: /
Disallow: /mob/
Disallow: /upload/
Disallow: /users/
Disallow: /wiki/

Second, we need to figure out, which links are of particular interest for your purpose. 其次,我们需要弄清楚哪些链接对于您的目的特别重要。 This needs some manual investigation. 这需要一些手动调查。 I only checked a few entries of the link mentioned above, but I found, that every port contains the keyword ports in its link, eg 我只检查了上面提到的链接的一些条目,但发现每个端口在其链接中都包含关键字ports ,例如

http://www.marinetraffic.com/en/ais/index/ports/all/per_page:50
http://www.marinetraffic.com/en/ais/details/ports/18853/China_port:YANGZHOU
http://www.marinetraffic.com/en/ais/details/ports/793/Korea_port:BUSAN

With this information, we are able to modify the shouldVisit method in a whitelisting manner. 有了这些信息,我们就能以白名单的方式修改shouldVisit方法。

public boolean shouldVisit(Page referringPage, WebURL url){

String href = url.getURL().toLowerCase();
return  !FILTERS.matcher(href).matches()
        && href.contains("www.marinetraffic.com");
        && href.contains("ports");
}

This is a very simple implementation, which could be enhanced by regular expressions. 这是一个非常简单的实现,可以通过正则表达式进行增强。

Third, we need to parse the data out of the HTML. 第三,我们需要从HTML中解析数据。 The information you are looking for is contained in the following <div> section: 您要查找的信息包含在以下<div>部分中:

<div class="bg-info bg-light padding-10 radius-4 text-left">
    <div>
        <span>Latitude / Longitude: </span>
        <b>1.2593655° / 103.75445°</b>
        <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655" title="Show on Map"><img class="loaded" src="/img/icons/show_on_map_magnify.png" data-original="/img/icons/show_on_map_magnify.png" alt="Show on Map" title="Show on Map"></a>
        <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655/showports:1" title="Show on Map">Show on Map</a>
    </div>

    <div>
        <span>Local Time:</span>
                <b><time>2016-12-11 19:20</time>&nbsp;[UTC +8]</b>
    </div>

            <div>
            <span>Un/locode: </span>
            <b>SGSIN</b>
        </div>

            <div>
            <span>Vessels in Port: </span>
            <b><a href="/en/ais/index/ships/range/port_id:290/port_name:SINGAPORE">1021</a></b>
        </div>

            <div>
            <span>Expected Arrivals: </span>
            <b><a href="/en/ais/index/eta/all/port:290/portname:SINGAPORE">1059</a></b>
        </div>

</div>

Basically, I would use a HTML Parser (eg Jericho ) for this task. 基本上,我将使用HTML解析器(例如Jericho )来完成此任务。 Then, you are able to exactly extract the correct <div> section and obtain the attributes you are looking for. 然后,您可以准确地提取正确的<div>部分,并获取所需的属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM