简体繁体 English

使用 Stormcrawler 抓取特定子目录

[英]Using Stormcrawler for crawling specific subdirectories

原文 2020-09-09 13:24:34 6 1 web-crawler/ stormcrawler

I would like to be able crawl very specific sub-directories for a given website.我希望能够抓取给定网站的非常具体的子目录。

For example: On the website www.world.com there maybe multiple sub-directories /world or /bye .例如：在网站www.world.com上可能有多个子目录/world或/bye 。 These in-turn may contain multiple pages /world/new etc. Lets assume that these pages themselves contain links to other pages which may not be in the same sub-directory.这些反过来可能包含多个页面/world/new等。让我们假设这些页面本身包含指向其他页面的链接，这些页面可能不在同一子目录中。 ( /world/new has a link to /bye/new ). （ /world/new有一个指向/bye/new的链接）。

What I would like to accomplish is to crawl the contents of every page under /world/ and only those pages.我想要完成的是抓取/world/下每个页面的内容，并且只抓取这些页面。

Would it be a good idea to ignore any outgoing link unless it also belongs to the same sub-directory?忽略任何传出链接是否是个好主意，除非它也属于同一个子目录？ I feel like a lot of the pages would not be reached because it would not be linked directly.我觉得很多页面都无法访问，因为它不会直接链接。 For example /world/new/ has a link /bye/new which in turn has a link to /world/next .例如/world/new/有一个链接/bye/new ，而后者又链接到/world/next 。 This would cause the crawler to not reach the /next page.这将导致爬虫无法到达/next页面。 (If I am understanding it correctly). （如果我理解正确的话）。

The alternative would be to crawl the entire website and then filter out the content based on URL post crawl, which would make the job itself significantly larger than it needs to be.另一种方法是爬取整个网站，然后根据 URL 后爬取过滤掉内容，这会使作业本身比需要的大得多。

Does Storm crawler have any configuration which could be used to make this simpler? Storm crawler 是否有任何配置可以使这更简单？ Or maybe there is a better approach to this solution?或者也许有更好的方法来解决这个问题？

Thank you.谢谢你。

1 个解决方案

You've described the two possible approaches in your question.您已经在问题中描述了两种可能的方法。 The easiest would be to use the URL Filters and restrict to the area of the site that you are interested in but as you pointed out, you might miss some content.最简单的方法是使用 URL 过滤器并将其限制在您感兴趣的站点区域，但正如您所指出的，您可能会错过一些内容。 The alternative is indeed more expensive as you'd have to crawl the whole site and you could then filter as part of the indexing step;替代方案确实更昂贵，因为您必须抓取整个网站，然后您可以作为索引步骤的一部分进行过滤； for this, you could add a simple parse filter to create a key /value in the metadata for URLs which are in the section of interest and use it as a value of indexer.md.filter .对于这一点，你可以添加一个简单的解析过滤器创建的元数据，其在感兴趣的部分，并把它作为indexer.md.filter的值的网址一键/值。

Of course, if the site provides sitemaps, you'd know about all the URLs it contains in advance and in that case you'd be able to rely on the URL filter alone.当然，如果站点提供站点地图，您就可以提前知道它包含的所有 URL，在这种情况下，您将能够单独依赖 URL 过滤器。