在 StormCrawler 中處理重定向域

Question

我正在研究基於 StormCrawler 的項目。 我們的要求之一是找到重定向到另一個域的域。 在 StormCrawler 中，每個重定向的 URL 被認為是爬行的深度。 例如，對於一個有兩個重定向步驟的域，我們需要使用 depth=2 進行爬取。 如何在不考慮爬蟲深度的情況下解析所有重定向的域？

Answer 1

過濾器不區分來自重定向的 URL 和來自頁面中鏈接的 URL。 您可以簡單地停用基於深度的過濾器，並在必要時使用自定義解析過濾器來限制外鏈。

Answer 2

我修改了 MaxDepthFilter 如下：

public class MaxDepthFilter implements URLFilter {

    private static final Logger LOG = LoggerFactory
            .getLogger(MaxDepthFilter.class);

    private int maxDepth;
    
    @Override
    public void configure(Map stormConf, JsonNode paramNode) {
        JsonNode node = paramNode.get("maxDepth");
        if (node != null && node.isInt()) {
            maxDepth = node.intValue();
        } else {
            maxDepth = -1;
            LOG.warn("maxDepth parameter not found");
        }
        
    }

    @Override
    public String filter(URL pageUrl, Metadata sourceMetadata, String url) {
        int depth = getDepth(sourceMetadata, MetadataTransfer.depthKeyName);
        
        boolean containsRedir = containsRedirect(sourceMetadata, "_redirTo");
        
        // is there a custom value set for this particular URL?
        int customMax = getDepth(sourceMetadata,
                MetadataTransfer.maxDepthKeyName);
        if (customMax >= 0) {
            return filter(depth, customMax, url);
        }
        // rely on the default max otherwise
        else if (maxDepth >= 0) {
            if(containsRedir)
                return url;
            else
                return filter(depth, maxDepth, url);
        }
        return url;
    }

    private String filter(int depth, int max, String url) {
        // deactivate the outlink no matter what the depth is
        if (max == 0) {
            return null;
        }
        if (depth >= max) {
            return null;
        }
        return url;
    }
    

    private int getDepth(Metadata sourceMetadata, String key) {
        if (sourceMetadata == null) {
            return -1;
        }
        String depth = sourceMetadata.getFirstValue(key);
        if (StringUtils.isNumeric(depth)) {
            return Integer.parseInt(depth);
        } else {
            return -1;
        }
    }
    
    private boolean containsRedirect(Metadata sourceMetadata, String key) {
        if (sourceMetadata == null) {
            return false;
        }
        String redir = sourceMetadata.getFirstValue(key);
        if (StringUtils.isNotBlank(redir)) {
            return true;
        } else {
            return false;
        }
    }
}

它是否正常工作或陷入無限循環？

在 StormCrawler 中處理重定向域

問題描述

2 個解決方案

解決方案1
1 已采納 2021-02-09 07:48:14

解決方案2
0 2021-02-14 11:12:47

在 StormCrawler 中處理重定向域

問題描述

2 個解決方案

解決方案1 1 已采納 2021-02-09 07:48:14

解決方案2 0 2021-02-14 11:12:47

解決方案1
1 已采納 2021-02-09 07:48:14

解決方案2
0 2021-02-14 11:12:47