简体   繁体   中英

How do I prevent Selenium from downloading certain “Sources” from a web-page?

I am using Selenium for some web-scraping activities, and I really feel the need to limit data consumption by blocking specific file types or filenames from being downloaded. I wish to block them by regex filters, like:

  • *.MP4
  • *.css
  • *ads.google.com*

So far I have not found any solutions and I am looking forward for a JavaScript one, if possible...

I have found the solution to be achievable by mediating a Chrome Extension middleware.

Particularily, in background-scripts , you could use onBeforeRequests to handle and filter each single request

chrome.webRequest.onBeforeRequest.addListener(
        function(info) {
            return {cancel: info.url.toLowerCase().includes('.css') || info.url.toLowerCase().includes('.gif') || info.url.toLowerCase().includes('.png') || info.url.toLowerCase().includes('.jpg') || info.url.toLowerCase().includes('.jpeg') || info.url.toLowerCase().includes('.webm') || info.url.toLowerCase().includes('.webp') ||info.url.toLowerCase().includes('.mp4') || info.url.toLowerCase().includes('allHeaderNonBlocking.js') || info.url.toLowerCase().includes('allHeader.js?') || info.url.toLowerCase().includes('/analytics.js') || info.url.toLowerCase().includes('googletagmanager') || info.url.toLowerCase().includes('calleo-livechat') || info.url.toLowerCase().includes('.svg') };
        },
        {
            urls: ["<all_urls>"]
        },
        ["blocking"]
    );

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM