使用crawler4j爬行和提取信息

Question

我需要幫助弄清楚如何爬網此頁面： http : //www.marinetraffic.com/en/ais/index/ports/all遍歷每個端口，提取名稱和坐標並將它們寫到文件中。 主類如下所示：

import java.io.FileWriter;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;


public class WorldPortSourceCrawler {

    public static void main(String[] args) throws Exception {
         String crawlStorageFolder = "data";
         int numberOfCrawlers = 5;

         CrawlConfig config = new CrawlConfig();
         config.setCrawlStorageFolder(crawlStorageFolder);
         config.setMaxDepthOfCrawling(2);
         config.setUserAgentString("Sorry for any inconvenience, I am trying to keep the traffic low per second");
         //config.setPolitenessDelay(20);
         /*
          * Instantiate the controller for this crawl.
          */
         PageFetcher pageFetcher = new PageFetcher(config);
         RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
         RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
         CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

         /*
          * For each crawl, you need to add some seed urls. These are the first
          * URLs that are fetched and then the crawler starts following links
          * which are found in these pages
          */
         controller.addSeed("http://www.marinetraffic.com/en/ais/index/ports/all");

         /*
          * Start the crawl. This is a blocking operation, meaning that your code
          * will reach the line after this only when crawling is finished.
          */
         controller.start(PortExtractor.class, numberOfCrawlers);    

         System.out.println("finished reading");
         System.out.println("Ports: " + PortExtractor.portList.size());
         FileWriter writer = new FileWriter("PortInfo2.txt");

         System.out.println("Writing to file...");
         for(Port p : PortExtractor.portList){
            writer.append(p.print() + "\n");
            writer.flush();
         }
         writer.close();
        System.out.println("File written");
        }
}

雖然Port Extractor看起來像這樣：

public class PortExtractor extends WebCrawler{

    private final static Pattern FILTERS = Pattern.compile(
            ".*(\\.(css|js|bmp|gif|jpe?g"
            + "|png|tiff?|mid|mp2|mp3|mp4"
            + "|wav|avi|mov|mpeg|ram|m4v|pdf"
            + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"
        );

    public static List<Port> portList = new ArrayList<Port>();

/**
*
* Crawling logic
*/
//@Override
public boolean shouldVisit(WebURL url) {

String href = url.getURL().toLowerCase();
//return  !FILTERS.matcher(href).matches()&&href.startsWith("http://www.worldportsource.com/countries.php") && !href.contains("/shipping/") && !href.contains("/cruising/") && !href.contains("/Today's Port of Call/") && !href.contains("/cruising/") && !href.contains("/portcall/") && !href.contains("/localviews/") && !href.contains("/commerce/")&& !href.contains("/maps/") && !href.contains("/waterways/");
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.marinetraffic.com/en/ais/index/ports/all");
}



/**
* This function is called when a page is fetched and ready 
* to be processed
*/
@Override
public void visit(Page page) {          
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);

   }

}

我該如何編寫html解析器，還應如何向程序指定它不應通過端口信息鏈接以外的任何內容進行爬網？ 我很難做到這一點，即使在運行代碼的情況下，每次嘗試使用HTML解析時都會中斷。 請任何幫助將不勝感激。

Answer 1

第一項任務是檢查網站的robots.txt以檢查l， crawler4j是否將自動爬網該網站。 研究此文件，我們發現這將沒有問題：

User-agent: *
Allow: /
Disallow: /mob/
Disallow: /upload/
Disallow: /users/
Disallow: /wiki/

其次，我們需要弄清楚哪些鏈接對於您的目的特別重要。 這需要一些手動調查。 我只檢查了上面提到的鏈接的一些條目，但發現每個端口在其鏈接中都包含關鍵字ports ，例如

http://www.marinetraffic.com/en/ais/index/ports/all/per_page:50
http://www.marinetraffic.com/en/ais/details/ports/18853/China_port:YANGZHOU
http://www.marinetraffic.com/en/ais/details/ports/793/Korea_port:BUSAN

有了這些信息，我們就能以白名單的方式修改shouldVisit方法。

public boolean shouldVisit(Page referringPage, WebURL url){

String href = url.getURL().toLowerCase();
return  !FILTERS.matcher(href).matches()
        && href.contains("www.marinetraffic.com");
        && href.contains("ports");
}

這是一個非常簡單的實現，可以通過正則表達式進行增強。

第三，我們需要從HTML中解析數據。 您要查找的信息包含在以下<div>部分中：

<div class="bg-info bg-light padding-10 radius-4 text-left">
    <div>
        <span>Latitude / Longitude: </span>
        <b>1.2593655° / 103.75445°</b>
        <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655" title="Show on Map"><img class="loaded" src="/img/icons/show_on_map_magnify.png" data-original="/img/icons/show_on_map_magnify.png" alt="Show on Map" title="Show on Map"></a>
        <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655/showports:1" title="Show on Map">Show on Map</a>
    </div>

    <div>
        <span>Local Time:</span>
                <b><time>2016-12-11 19:20</time>&nbsp;[UTC +8]</b>
    </div>

            <div>
            <span>Un/locode: </span>
            <b>SGSIN</b>
        </div>

            <div>
            <span>Vessels in Port: </span>
            <b><a href="/en/ais/index/ships/range/port_id:290/port_name:SINGAPORE">1021</a></b>
        </div>

            <div>
            <span>Expected Arrivals: </span>
            <b><a href="/en/ais/index/eta/all/port:290/portname:SINGAPORE">1059</a></b>
        </div>

</div>

基本上，我將使用HTML解析器（例如Jericho ）來完成此任務。 然后，您可以准確地提取正確的<div>部分，並獲取所需的屬性。

使用crawler4j爬行和提取信息

問題描述

1 個解決方案

解決方案1
2 已采納 2016-12-11 11:24:57

使用crawler4j爬行和提取信息

問題描述

1 個解決方案

解決方案1 2 已采納 2016-12-11 11:24:57

解決方案1
2 已采納 2016-12-11 11:24:57