简体   繁体   English

如何将crawler4j数据发送到CrawlerManager?

[英]How to send crawler4j data to CrawlerManager?

I'm working with a project where user can search some websites and look for pictures which have unique identifier. 我正在与一个项目合作,用户可以在其中搜索一些网站并查找具有唯一标识符的图片。

public class ImageCrawler extends WebCrawler {

private static final Pattern filters = Pattern.compile(
        ".*(\\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
                "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

private static final Pattern imgPatterns = Pattern.compile(".*(\\.(bmp|gif|jpe?g|png|tiff?))$");

public ImageCrawler() {
}

@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    if (filters.matcher(href).matches()) {
        return false;
    }

    if (imgPatterns.matcher(href).matches()) {
        return true;
    }

    return false;
}

@Override
public void visit(Page page) {
    String url = page.getWebURL().getURL();

    byte[] imageBytes = page.getContentData();
    String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);
    try {
        SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));
        DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);
        URLScanResult urlScanResult = new URLScanResult();
        urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());
        urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());
        urlScanResult.setPictureUrl(url);
        urlScanResult.setUrlScan(urlScan);
        urlScan.getResults().add(urlScanResult);
        urlScanRepository.save(urlScan);
    }

    } catch (ResourceNotFoundException ex) {
        //Picture is not in our database
    }
}

Crawlers will be run independently. 搜寻器将独立运行。 ImageCrawlerManager class, which is singletone, run crawlers. ImageCrawlerManager类(单调)运行搜寻器。

public class ImageCrawlerManager {

private static ImageCrawlerManager instance = null;


private ImageCrawlerManager(){
}

public synchronized static ImageCrawlerManager getInstance()
{
    if (instance == null)
    {
        instance = new ImageCrawlerManager();
    }
    return instance;
}

@Transactional(propagation=Propagation.REQUIRED)
@PersistenceContext(type = PersistenceContextType.EXTENDED)
public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){

    try {
        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder("/tmp");
        config.setIncludeBinaryContentInCrawling(true);

        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);

        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
        controller.addSeed(urlScan.getUrl());

        controller.start(ImageCrawler.class, 1);
        urlScan.setStatus(URLScanStatus.FINISHED);
        urlScanRepository.save(urlScan);
    } catch (Exception e) {
        e.printStackTrace();
        urlScan.setStatus(URLScanStatus.FAILED);
        urlScan.setFailedReason(e.getMessage());
        urlScanRepository.save(urlScan);
    }
}

How to send every image data to manager which decode this image, get the initiator of search and save results to database? 如何将每个图像数据发送到对图像进行解码的管理器,获取搜索的发起者并将结果保存到数据库? In code above I can run multiple crawlers and save it to database. 在上面的代码中,我可以运行多个搜寻器并将其保存到数据库。 But unfortunately when i run two crawlers simultaneously, I can store two search results but all of them are connected to the crawler which was run first. 但是不幸的是,当我同时运行两个搜寻器时,我可以存储两个搜索结果,但所有搜索结果都连接到首先运行的搜寻器。

You should inject your database service into your ẀebCrawler instances and not use a singleton to manage the result of your web-crawl. 您应该数据库服务注入ẀebCrawler实例中,而不要使用单例来管理Web爬网的结果。

crawler4j supports a custom CrawlController.WebCrawlerFactory (see here for reference), which can be used with Spring to inject your database service into a ImageCrawler instance. crawler4j支持自定义的CrawlController.WebCrawlerFactory (请参阅此处作为参考),可与Spring一起使用,以将数据库服务注入ImageCrawler实例。

Every single crawler thread should be responsible for the whole process you described with (eg by using some specific services for it): 每个搜寻器线程都应对您描述的整个过程负责(例如,通过为其使用某些特定服务):

decode this image, get the initiator of search and save results to database 解码此图像,获取搜索的发起者并将结果保存到数据库

Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions. 像这样进行设置,您的数据库将是唯一的事实来源,您将不必处理不同实例或用户会话之间的爬网程序状态同步。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM