简体   繁体   中英

Scrapy / crawling - detecting spider traps or infinite websites

Having read " Why Johnny Can't Pentest: An Analysis of Black-box Web Vulnerability Scanners ", it is understood that there are websites such as calendar applications which crawlers have difficulty in dealing with. They are seemingly "infinite" websites which can just contain links to the next day/month/year etc.

Also, some websites set up spider traps or may inadvertently create a similar system (where the page links are never-ending).

If I a) have the permission of the site owner to crawl freely through their website and b) wish to use scrapy, what sort of technique can I use to determine if I have indeed encountered an "infinite" website, not specific to any example?

Note: I'm not talking about "infinite" scrolling, but rather when there are endless pages.

An example of an infinite website could be (though pointless and trivial):

<?php
if(isset($_GET['count'])){
    $count = intval($_GET['count']);
    $previous = $count - 1;
    $next = $count + 1;
    ?>
    <a href="?count=<?php echo $previous;?>">< Previous</a>

    Current: <?php echo $count;?>

    <a href="?count=<?php echo $next;?>">Next ></a>
    <?
}

?>

where you just keep click next and previous to reveal more pages.

Even when pagination is endless, content usually is not. So, when the issue is endless pagination, you can prevent endless looping by fetching the next page only if the current page has content or, if you want to be optimum, only when the current page has the known number of items per page.

In other cases, such as browsing a calendar where some dates may have values where others do not, you can hardcode a limit on your spider (if the date covered by the next URL is X or older, do not parse further).

One thing I can think of it that pass all items IDs to next page you are scraping and then check if the next page has same items, that means pagination has ended, there are no new records

def parse(self, response):

    this_page_items = []
    for item in response.css("li .items")
        this_page_items.extend([ item.css("any unique thing here").extract_first() ])


    if "prev_page_items" in response.meta:
        prev_page_items = response.meta['prev_page_items']
        if sorted(prev_page_items) == sorted(this_page_items):
            return #ternimate next page calls

    #go to next page
    yield Request(url, callback=self.parse, meta={"prev_page_items": this_page_items})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM