简体   繁体   English

python html提取标签

[英]python html extract tags

How would it be possible to do the following: 如何执行以下操作:

  1. Scan through an html page (preferably through a whole domain (www.python.org) and extract all 扫描html页面(最好遍及整个域(www.python.org))并提取所有

h1 h2 ...hn Tags h1 h2 ... hn标签

and write all Headings to a file. 并将所有标题写入文件。 In the correct order: 按照正确的顺序:

Start with h1 Than h2 从h1开始比h2

until we reach the next h1 直到我们到达下一个h1

使用BeautifulSoupPyQuery

Given the requirement to scan a whole website, you might want to look into pycurl to grab the files to scrape. 鉴于需要扫描整个网站,您可能需要研究pycurl来抓取要抓取的文件。 Be careful not to hit the site with the equivalent of a DoS attack though. 但是请注意,不要以与DoS攻击相当的方式访问该站点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM