简体   繁体   中英

How to save a website made with javascript to a file

A little info:

When 'inspected' (Google Chrome), the website displays the information I need (namely, a simple link to a .pdf).

When I cURL the website, only a part of it gets saved. This coupled with the fact that there are functions and <script> tags leads me to believe that javascript is the culprit (I'm honestly not 100% sure, as I'm pretty new at this).

I need to pull this link periodically, and it changes each time.

The question:

Is there a way for me, in bash, to run this javascript and save the new HTML code it generates to a file?

Not trivially.

Typically, for that approach, you need to:

  • Construct a DOM from the HTML
  • Execute the JavaScript in the context of that DOM while resolving URLs relative to the URL you fetched the HTML from

There are tools which can help with this, such as Puppeteer, PhantomJS, and Selenium, but they generally lend themselves to being driven with beefier programming languages than bash.

As an alternative, you can look at reverse engineering the page. It gets the data from somewhere . You can probably work out the URLs (the Network tab of a browser's developer tools is helpful there) and access them directly.

If you want to download a web page that generates itself with JavaScript, you'll need to execute this JavaScript in order to load the page. To achieve this you can use libraries that do this like puppeteer with NodeJS. There's a lot of other libraries, but that's the most popular.

If you're wondering why does this happens, it's because web developers often use frameworks like React, Vue or Angular to quote the most popular ones which only generates a JavaScript output that's not executed by common HTTP requesting libraries.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM