简体   繁体   English

如何使用 Python 从 HTML 文件运行脚本?

[英]How to run scripts from a HTML file using Python?

I am building a scraper in Python and want to call scripts, which are in a HTML code, like web browsers do while entering a website.我正在 Python 中构建一个刮板,并想调用 HTML 代码中的脚本,就像 web 浏览器在进入网站时所做的那样。 Is it possible to run scripts from Python level on the HTML code?是否可以在 HTML 代码上从 Python 级别运行脚本?

The purpose is to get the same HTML code as in 'DOM Inspector' in web browser, not the code you can simply download with目的是在 web 浏览器中获得与“DOM Inspector”中相同的 HTML 代码,而不是您可以简单下载的代码

requests.get("https://example.com")

or when you use 'View Source' mode...或者当您使用“查看源代码”模式时...

Let's say I get the code:假设我得到了代码:

<!DOCTYPE html>
<html>
  <body>
    <h1>The script element</h1>
    <p id="demo"></p>

    <script>
      document.getElementById("demo").innerHTML = "Hello JavaScript!";
    </script>
  </body>
</html>

If I run the script it will change the content of <p> element and results will be如果我运行脚本,它将更改 <p> 元素的内容,结果将是

<!DOCTYPE html>
<html>
  <body>
    <h1>The script element</h1>
    <p id="demo">Hello JavaScript!</p>
  </body>
</html>

So how can I run it and get the HTML code evaluated with all scripts from a page?那么如何运行它并使用页面中的所有脚本评估 HTML 代码?

EDIT 1: I know the library "Selenium", but I am trying to solve my problem without using browser simulators, just Python and JavaScript...编辑1:我知道“Selenium”库,但我试图在不使用浏览器模拟器的情况下解决我的问题,只需 Python 和 JavaScript ...

Thanks in advance!提前致谢!

You should use Brython then only you can combine HTML and Python language else you have to Learn framework like Django please let me know if it was helpful for you.你应该使用Brython然后只有你可以结合 HTML 和 Python 语言,否则你必须学习像Django这样的框架,如果它对你有帮助,请告诉我。

You can use the library "selenium" to load the page, there are also elements included to click on website/ you can interact with it.您可以使用库“selenium”来加载页面,还包含用于单击网站的元素/您可以与之交互。 to then scrape it, you might want to use the library "beautifulsoup4".然后刮掉它,你可能想使用库“beautifulsoup4”。 you need to install both in cmd using您需要使用 cmd 安装两者

pip install selenium

and

pip install beautifulsoup4

Maybe you can get the response and write it to a html file也许您可以获得响应并将其写入 html 文件

You are looking for a cross-platform testing framework, named Selenium您正在寻找一个名为Selenium的跨平台测试框架

You can launch pages using a webdriver in browsers like Firefox or Chrome, and interact with page like you are scripting.您可以在 Firefox 或 Chrome 等浏览器中使用网络驱动程序启动页面,并像编写脚本一样与页面交互。

Docs文档

This is partially answered by remove tags/stype-tags from html with lxml这可以通过使用 lxml 从 html 中删除标签/stype-tags 来部分回答

It is unclear what exactly you need.目前尚不清楚您到底需要什么。 From the example, you are just looking to remove <script... /script> elements...从示例中,您只是想删除 <script... /script> 元素...

From The purpose is to get the same HTML code as in 'DOM Inspector' in web browser it seems you want to parse the HTML using python, ie. From The purpose is to get the same HTML code as in 'DOM Inspector' in web browser it seems you want to parse the HTML using python, ie. parse, display and show details of all the elements.解析、显示和显示所有元素的详细信息。

I personally prefer low level and lxml for html seems to do the trick...我个人更喜欢 html 的低级别和lxml似乎可以解决问题...

Sample code:示例代码:

from lxml import etree
from lxml import html
from lxml.html.clean import Cleaner

h = '''<!DOCTYPE html>
<html>
  <body>
    <h1>The script element</h1>
    <p id="demo"></p>

    <script>
      document.getElementById("demo").innerHTML = "Hello JavaScript!";
    </script>
  </body>
</html>'''

# Using etree 
root = etree.fromstring(h)
print(etree.tostring(root).decode())

# Code came from previous reference link above: https://stackoverflow.com/questions/8554035/remove-all-javascript-tags-and-style-tags-from-html-with-python-and-the-lxml-mod
cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

# using html parser
print("\nWITH JAVASCRIPT & STYLES")
print(html.tostring(html.fromstring(h)).decode())
print("\nWITHOUT JAVASCRIPT & STYLES")
print(html.tostring(cleaner.clean_html(html.fromstring(h))).decode())

output: output:

<html>
  <body>
    <h1>The script element</h1>
    <p id="demo"/>

    <script>
      document.getElementById("demo").innerHTML = "Hello JavaScript!";
    </script>
  </body>
</html>

WITH JAVASCRIPT & STYLES
<html>
  <body>
    <h1>The script element</h1>
    <p id="demo"></p>

    <script>
      document.getElementById("demo").innerHTML = "Hello JavaScript!";
    </script>
  </body>
</html>

WITHOUT JAVASCRIPT & STYLES
<div>
  <body>
    <h1>The script element</h1>
    <p id="demo"></p>


  </body>
</div>
​

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM