[英]How to save "complete webpage" not just basic html using Python
I am using following code to save webpage using Python:我正在使用以下代码使用 Python 保存网页:
import urllib
import sys
from bs4 import BeautifulSoup
url = 'http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test.html')
Problem : This code saves html as basic html without javascripts, images etc. I want to save webpage as complete (Like we have option in browser)问题:此代码将 html 保存为基本 html 没有 javascript、图像等。我想将网页保存为完整(就像我们在浏览器中有选项一样)
Update : I am using following code now to save all the js/images/css files of webapge so that it can be saved as complete webpage but still my output html is getting saved like basic html:更新:我现在使用以下代码保存 webapge 的所有 js/images/css 文件,以便可以将其保存为完整的网页,但我的 output html 仍然像基本的 ZFC35FDC70D5FC69D2639EZ8883 一样保存
import pycurl
import StringIO
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html")
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()
#print html
fh = open("file.html", "w")
fh.write(html)
fh.close()
Try emulating your browser with selenium .尝试使用selenium模拟您的浏览器。 This script will pop up the
save as
dialog for the webpage.此脚本将弹出网页的
save as
对话框。 You will still have to figure out how to emulate pressing enter for download to start as the file dialog is out of selenium's reach (how you do it is also OS dependent).您仍然需要弄清楚如何模拟按下 Enter 以开始下载,因为文件对话框超出了 selenium 的范围(您如何操作也取决于操作系统)。
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
br = webdriver.Firefox()
br.get('http://www.google.com/')
save_me = ActionChains(br).key_down(Keys.CONTROL)\
.key_down('s').key_up(Keys.CONTROL).key_up('s')
save_me.perform()
Also I think following @Amber suggestion of grabbing the the linked resources may be a simpler, thus a better solution.此外,我认为遵循@Amber建议获取链接资源可能更简单,因此是更好的解决方案。 Still, I think using selenium is a good starting point as
br.page_source
will get you the entire dom along with the dynamic content generated by javascript.尽管如此,我认为使用 selenium 是一个很好的起点,因为
br.page_source
将为您提供整个 dom 以及由 javascript 生成的动态内容。
You can easily do that with simple python library pywebcopy.您可以使用简单的 python 库 pywebcopy 轻松做到这一点。
For Current version: 5.0.1
对于当前版本:5.0.1
from pywebcopy import save_webpage
url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'
kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}
save_webpage(url, download_folder, **kwargs)
You will have html, css, js all at your download_folder.您的 download_folder 中将包含 html、css、js。 Completely working like original site.
完全像原始网站一样工作。
To get the script above by @rajatomar788 to run, I had to do all of the following imports first:为了让@rajatomar788 上面的脚本运行,我必须首先执行以下所有导入:
pip install pywebcopy
pip install pyquery
pip install w3lib
pip install parse
pip install lxml
After that it worked with a few errors, but I did get the folder filled with the files that make up the webpage.之后它出现了一些错误,但我确实让文件夹充满了构成网页的文件。
webpage - INFO - Starting save_assets Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'
webpage - Level 100 - Queueing download of <89> asset files.
Exception in thread <Element(LinkTag, file:///++resource++images/favicon2.ico)>:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\threading.py", line 917, in _bootstrap_inner
self.run()
File "C:\ProgramData\Anaconda3\lib\threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 312, in run
super(LinkTag, self).run()
File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 58, in run
self.download_file()
File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 107, in download_file
req = SESSION.get(url, stream=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\configs.py", line 244, in get
return super(AccessAwareSession, self).get(url, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 640, in send
adapter = self.get_adapter(url=request.url)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 731, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///++resource++images/favicon2.ico'
webpage - INFO - Starting save_html Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'
saveFullHtmlPage
bellow or adapt it.saveFullHtmlPage
或调整它。 Will save a modified *.html and save javascripts, css and images based on the tags script, link and img (tags_inner dict keys) on a folder _files
.将保存修改后的 *.html 并保存 javascripts、css 和基于标签脚本、链接和 img(tags_inner 字典键)的图像到文件夹
_files
上。
import os, sys, re
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def saveFullHtmlPage(url, pagepath='page', session=requests.Session(), html=None):
"""Save web page html and supported contents
* pagepath : path-to-page
It will create a file `'path-to-page'.html` and a folder `'path-to-page'_files`
"""
def savenRename(soup, pagefolder, session, url, tag, inner):
if not os.path.exists(pagefolder): # create only once
os.mkdir(pagefolder)
for res in soup.findAll(tag): # images, css, etc..
if res.has_attr(inner): # check inner tag (file object) MUST exists
try:
filename, ext = os.path.splitext(os.path.basename(res[inner])) # get name and extension
filename = re.sub('\W+', '', filename) + ext # clean special chars from name
fileurl = urljoin(url, res.get(inner))
filepath = os.path.join(pagefolder, filename)
# rename html ref so can move html and folder of files anywhere
res[inner] = os.path.join(os.path.basename(pagefolder), filename)
if not os.path.isfile(filepath): # was not downloaded
with open(filepath, 'wb') as file:
filebin = session.get(fileurl)
file.write(filebin.content)
except Exception as exc:
print(exc, file=sys.stderr)
if not html:
html = session.get(url).text
soup = BeautifulSoup(html, "html.parser")
path, _ = os.path.splitext(pagepath)
pagefolder = path+'_files' # page contents folder
tags_inner = {'img': 'src', 'link': 'href', 'script': 'src'} # tag&inner tags to grab
for tag, inner in tags_inner.items(): # saves resource files and rename refs
savenRename(soup, pagefolder, session, url, tag, inner)
with open(path+'.html', 'wb') as file: # saves modified html doc
file.write(soup.prettify('utf-8'))
Example saving google.com
as google.html
and contents on google_files
folder.将google.com保存为
google.com
和google.html
文件夹中的内容的google_files
。 ( current folder ) (当前文件夹)
saveFullHtmlPage('https://www.google.com', 'google')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.