简体   繁体   English

如何将用javascript制作的网站保存到文件中

[英]How to save a website made with javascript to a file

A little info:一点信息:

When 'inspected' (Google Chrome), the website displays the information I need (namely, a simple link to a .pdf).当“检查”(谷歌浏览器)时,该网站会显示我需要的信息(即一个指向 .pdf 的简单链接)。

When I cURL the website, only a part of it gets saved.当我卷曲网站时,只有一部分被保存。 This coupled with the fact that there are functions and <script> tags leads me to believe that javascript is the culprit (I'm honestly not 100% sure, as I'm pretty new at this).再加上有函数和 <script> 标签这一事实让我相信 javascript 是罪魁祸首(老实说,我不是 100% 确定,因为我对此很陌生)。

I need to pull this link periodically, and it changes each time.我需要定期拉这个链接,它每次都会改变。

The question:问题:

Is there a way for me, in bash, to run this javascript and save the new HTML code it generates to a file?我有没有办法在 bash 中运行这个 javascript 并将它生成的新 HTML 代码保存到文件中?

Not trivially.不平凡。

Typically, for that approach, you need to:通常,对于这种方法,您需要:

  • Construct a DOM from the HTML从 HTML 构建 DOM
  • Execute the JavaScript in the context of that DOM while resolving URLs relative to the URL you fetched the HTML from在该 DOM 的上下文中执行 JavaScript,同时解析与您从中获取 HTML 的 URL 相关的 URL

There are tools which can help with this, such as Puppeteer, PhantomJS, and Selenium, but they generally lend themselves to being driven with beefier programming languages than bash.有一些工具可以帮助解决这个问题,例如 Puppeteer、PhantomJS 和 Selenium,但它们通常适合使用比 bash 更强大的编程语言来驱动。

As an alternative, you can look at reverse engineering the page.作为替代方案,您可以查看页面的逆向工程。 It gets the data from somewhere .它从某处获取数据。 You can probably work out the URLs (the Network tab of a browser's developer tools is helpful there) and access them directly.您可能可以计算出 URL(浏览器开发人员工具的网络选项卡在那里很有帮助)并直接访问它们。

If you want to download a web page that generates itself with JavaScript, you'll need to execute this JavaScript in order to load the page.如果您想下载一个使用 JavaScript 自行生成的网页,则需要执行此 JavaScript 才能加载该页面。 To achieve this you can use libraries that do this like puppeteer with NodeJS.要实现这一点,您可以使用像puppeteer和 NodeJS 一样执行此操作的库。 There's a lot of other libraries, but that's the most popular.还有很多其他库,但这是最受欢迎的。

If you're wondering why does this happens, it's because web developers often use frameworks like React, Vue or Angular to quote the most popular ones which only generates a JavaScript output that's not executed by common HTTP requesting libraries.如果您想知道为什么会发生这种情况,那是因为 Web 开发人员经常使用 React、Vue 或 Angular 等框架来引用最流行的框架,这些框架仅生成不由常见 HTTP 请求库执行的 JavaScript 输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM