简体   繁体   中英

WGET - Download specific files (by extension or mime-type) from third party websites

I need to get ALL ".js" extension files from a website by using wget, including third party ones, but it's not always being done.

I use the next code:

wget -H -p -A "*.js" -e robots=off --no-check-certificate https://www.quantcast.com

For example, if I execute wget to " https://www.stackoverflow.com " I want to get all "*.js" files from stackoverflow.com but also third party websites, such as "scorecardresearch.com", "secure.quantserve.com" and others.

Is something missing in my code?

Thanks in advance!

Wget with the -p flag will only download simple page requirements like scripts with a src , links with an href or images with a src .

Third party scripts are often loaded dynamically using script snippets (such as Google Tag Manager https://developers.google.com/tag-manager/quickstart ). These dynamically loaded script will not be downloaded by Wget, since they need to run the JavaScript to actually load. To get absolutely everything, you would likely need something like Pupeteer or Selenium to load the page and scrape the contents.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM