简体   繁体   English

如何使用 Python 保存“完整网页”而不仅仅是基本的 html

[英]How to save "complete webpage" not just basic html using Python

I am using following code to save webpage using Python:我正在使用以下代码使用 Python 保存网页:

import urllib
import sys
from bs4 import BeautifulSoup

url = 'http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test.html')

Problem : This code saves html as basic html without javascripts, images etc. I want to save webpage as complete (Like we have option in browser)问题:此代码将 html 保存为基本 html 没有 javascript、图像等。我想将网页保存为完整(就像我们在浏览器中有选项一样)

Update : I am using following code now to save all the js/images/css files of webapge so that it can be saved as complete webpage but still my output html is getting saved like basic html:更新:我现在使用以下代码保存 webapge 的所有 js/images/css 文件,以便可以将其保存为完整的网页,但我的 output html 仍然像基本的 ZFC35FDC70D5FC69D2639EZ8883 一样保存

import pycurl
import StringIO

c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html")

b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()
#print html
fh = open("file.html", "w")
fh.write(html)
fh.close()

Try emulating your browser with selenium .尝试使用selenium模拟您的浏览器。 This script will pop up the save as dialog for the webpage.此脚本将弹出网页的save as对话框。 You will still have to figure out how to emulate pressing enter for download to start as the file dialog is out of selenium's reach (how you do it is also OS dependent).您仍然需要弄清楚如何模拟按下 Enter 以开始下载,因为文件对话框超出了 selenium 的范围(您如何操作也取决于操作系统)。

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

br = webdriver.Firefox()
br.get('http://www.google.com/')

save_me = ActionChains(br).key_down(Keys.CONTROL)\
         .key_down('s').key_up(Keys.CONTROL).key_up('s')
save_me.perform()

Also I think following @Amber suggestion of grabbing the the linked resources may be a simpler, thus a better solution.此外,我认为遵循@Amber建议获取链接资源可能更简单,因此是更好的解决方案。 Still, I think using selenium is a good starting point as br.page_source will get you the entire dom along with the dynamic content generated by javascript.尽管如此,我认为使用 selenium 是一个很好的起点,因为br.page_source将为您提供整个 dom 以及由 javascript 生成的动态内容。

You can easily do that with simple python library pywebcopy.您可以使用简单的 python 库 pywebcopy 轻松做到这一点。

For Current version: 5.0.1对于当前版本:5.0.1

from pywebcopy import save_webpage

url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'    

kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}

save_webpage(url, download_folder, **kwargs)

You will have html, css, js all at your download_folder.您的 download_folder 中将包含 html、css、js。 Completely working like original site.完全像原始网站一样工作。

To get the script above by @rajatomar788 to run, I had to do all of the following imports first:为了让@rajatomar788 上面的脚本运行,我必须首先执行以下所有导入:

To run pywebcopy you will need to install the following packages:要运行 pywebcopy,您需要安装以下软件包:

pip install pywebcopy 
pip install pyquery
pip install w3lib
pip install parse 
pip install lxml

After that it worked with a few errors, but I did get the folder filled with the files that make up the webpage.之后它出现了一些错误,但我确实让文件夹充满了构成网页的文件。

webpage    - INFO     - Starting save_assets Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'
webpage    - Level 100 - Queueing download of <89> asset files.
Exception in thread <Element(LinkTag, file:///++resource++images/favicon2.ico)>:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\threading.py", line 917, in _bootstrap_inner
    self.run()
  File "C:\ProgramData\Anaconda3\lib\threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 312, in run
    super(LinkTag, self).run()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 58, in run
    self.download_file()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 107, in download_file
    req = SESSION.get(url, stream=True)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\configs.py", line 244, in get
    return super(AccessAwareSession, self).get(url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 640, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 731, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///++resource++images/favicon2.ico'

webpage    - INFO     - Starting save_html Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'

Try saveFullHtmlPage bellow or adapt it.试试下面saveFullHtmlPage或调整它。

Will save a modified *.html and save javascripts, css and images based on the tags script, link and img (tags_inner dict keys) on a folder _files .将保存修改后的 *.html 并保存 javascripts、css 和基于标签脚本、链接和 img(tags_inner 字典键)的图像到文件夹_files上。

import os, sys, re
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def saveFullHtmlPage(url, pagepath='page', session=requests.Session(), html=None):
    """Save web page html and supported contents        
        * pagepath : path-to-page   
        It will create a file  `'path-to-page'.html` and a folder `'path-to-page'_files`
    """
    def savenRename(soup, pagefolder, session, url, tag, inner):
        if not os.path.exists(pagefolder): # create only once
            os.mkdir(pagefolder)
        for res in soup.findAll(tag):   # images, css, etc..
            if res.has_attr(inner): # check inner tag (file object) MUST exists  
                try:
                    filename, ext = os.path.splitext(os.path.basename(res[inner])) # get name and extension
                    filename = re.sub('\W+', '', filename) + ext # clean special chars from name
                    fileurl = urljoin(url, res.get(inner))
                    filepath = os.path.join(pagefolder, filename)
                    # rename html ref so can move html and folder of files anywhere
                    res[inner] = os.path.join(os.path.basename(pagefolder), filename)
                    if not os.path.isfile(filepath): # was not downloaded
                        with open(filepath, 'wb') as file:
                            filebin = session.get(fileurl)
                            file.write(filebin.content)
                except Exception as exc:
                    print(exc, file=sys.stderr)
    if not html:
        html = session.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    path, _ = os.path.splitext(pagepath)
    pagefolder = path+'_files' # page contents folder
    tags_inner = {'img': 'src', 'link': 'href', 'script': 'src'} # tag&inner tags to grab
    for tag, inner in tags_inner.items(): # saves resource files and rename refs
        savenRename(soup, pagefolder, session, url, tag, inner)
    with open(path+'.html', 'wb') as file: # saves modified html doc
        file.write(soup.prettify('utf-8'))

Example saving google.com as google.html and contents on google_files folder.google.com保存为google.comgoogle.html文件夹中的内容的google_files ( current folder ) 当前文件夹

saveFullHtmlPage('https://www.google.com', 'google')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在python中获取完整的网页(使用javascript) - How to fetch complete webpage (using javascript) in python 如何使用Python Selenium下载完整的网页 - How to download complete webpage using Python Selenium python-使用splinter打开和登录网页,但需要保存完整的网页 - python- using splinter to open and login webpage but need to save the complete webpage 如何在 python 中使用“control + S”保存网页? - How to save a webpage using “control + S” in python? Python:如何从电子邮件中的链接保存网页(作为 html 文件) - Python: how to save a webpage (as html file) from the link in the Email 如何使用Selenium和python下载HTML网页? - How to download a HTML webpage using Selenium with python? 使用 python 将网页另存为文件 - using python to save webpage as a file 如何使用 Python 控制在网页上打开的“另存为”窗口? - How to control a "Save as" window that opens on a webpage, using Python? 使用 Selenium 和 Python 将整个网页下载为 HTML(包括 HTML 资产)而不另存为弹出窗口 - Download entire webpage as HTML (including the HTML assets) without save as pop up using Selenium and Python 使用std lib的Python基本Comet - Basic Comet in Python using just std lib
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM