![](/img/trans.png)
[英]Feign client always throws a null pointer exception in a Spring boot/Crawler4j app
[英]How to send crawler4j data to CrawlerManager?
我正在與一個項目合作,用戶可以在其中搜索一些網站並查找具有唯一標識符的圖片。
public class ImageCrawler extends WebCrawler {
private static final Pattern filters = Pattern.compile(
".*(\\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
"|rm|smil|wmv|swf|wma|zip|rar|gz))$");
private static final Pattern imgPatterns = Pattern.compile(".*(\\.(bmp|gif|jpe?g|png|tiff?))$");
public ImageCrawler() {
}
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (imgPatterns.matcher(href).matches()) {
return true;
}
return false;
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
byte[] imageBytes = page.getContentData();
String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);
try {
SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));
DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);
URLScanResult urlScanResult = new URLScanResult();
urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());
urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());
urlScanResult.setPictureUrl(url);
urlScanResult.setUrlScan(urlScan);
urlScan.getResults().add(urlScanResult);
urlScanRepository.save(urlScan);
}
} catch (ResourceNotFoundException ex) {
//Picture is not in our database
}
}
搜尋器將獨立運行。 ImageCrawlerManager類(單調)運行搜尋器。
public class ImageCrawlerManager {
private static ImageCrawlerManager instance = null;
private ImageCrawlerManager(){
}
public synchronized static ImageCrawlerManager getInstance()
{
if (instance == null)
{
instance = new ImageCrawlerManager();
}
return instance;
}
@Transactional(propagation=Propagation.REQUIRED)
@PersistenceContext(type = PersistenceContextType.EXTENDED)
public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){
try {
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder("/tmp");
config.setIncludeBinaryContentInCrawling(true);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed(urlScan.getUrl());
controller.start(ImageCrawler.class, 1);
urlScan.setStatus(URLScanStatus.FINISHED);
urlScanRepository.save(urlScan);
} catch (Exception e) {
e.printStackTrace();
urlScan.setStatus(URLScanStatus.FAILED);
urlScan.setFailedReason(e.getMessage());
urlScanRepository.save(urlScan);
}
}
如何將每個圖像數據發送到對圖像進行解碼的管理器,獲取搜索的發起者並將結果保存到數據庫? 在上面的代碼中,我可以運行多個搜尋器並將其保存到數據庫。 但是不幸的是,當我同時運行兩個搜尋器時,我可以存儲兩個搜索結果,但所有搜索結果都連接到首先運行的搜尋器。
您應該將數據庫服務注入ẀebCrawler
實例中,而不要使用單例來管理Web爬網的結果。
crawler4j
支持自定義的CrawlController.WebCrawlerFactory
(請參閱此處作為參考),可與Spring一起使用,以將數據庫服務注入ImageCrawler
實例。
每個搜尋器線程都應對您描述的整個過程負責(例如,通過為其使用某些特定服務):
解碼此圖像,獲取搜索的發起者並將結果保存到數據庫
像這樣進行設置,您的數據庫將是唯一的事實來源,您將不必處理不同實例或用戶會話之間的爬網程序狀態同步。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.