简体   繁体   English

使用 python/selenium 保存完整的网页(包括 css,图像)

[英]Save complete web page (incl css, images) using python/selenium

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back.我正在使用 Python/Selenium 将基因序列提交到在线数据库,并希望保存我返回的整页结果。 Below is the code that gets me to the results I want:下面是让我得到我想要的结果的代码:

from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)

At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine).那时我有一个页面,我可以手动单击“另存为”并获取一个本地文件(带有相应的图像/js 资产文件夹),让我可以在本地查看整个返回的页面(减去从动态生成的内容)向下滚动页面,这很好)。 I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one.我以为会有一种简单的方法来模仿 python/selenium 中的这种“另存为”功能,但还没有找到。 The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.下面保存页面的代码只是保存了 html,并没有给我留下一个看起来像在网络浏览器中一样的本地文件,带有图像等。

content = driver.page_source
with open('webpage.html', 'w') as f:
    f.write(content)

I've also found this question/answer on SO , but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)我还在SO 上找到了这个问题/答案,但是接受的答案只是弹出了“另存为”框,并且没有提供点击它的方法(正如两位评论者指出的那样)

Is there a simple way to 'save [full page] as' using python?有没有一种使用 python 将 [整页] 另存为的简单方法? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job.理想情况下,我更喜欢使用 selenium 的答案,因为 selenium 使爬行部分变得如此简单,但如果有更好的工具来完成这项工作,我愿意使用另一个库。 Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?或者也许我只需要在代码中指定我想下载的所有图像/表格,并且没有模拟右键单击“另存为”功能的快捷方式?

UPDATE - Follow up question for James' answer So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as.更新 - 跟进 James 回答的问题 所以我运行 James 的代码来生成page.html (和相关文件)并将其与我通过手动单击另存为获得的 html 文件进行比较。 The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page.通过 James 的脚本保存的page.html很棒并且拥有我需要的一切,但在浏览器中打开时它还会显示许多隐藏在手动保存页面中的额外格式文本。 See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).请参阅随附的屏幕截图(左侧为手动保存的页面,右侧显示带有额外格式文本的脚本保存页面)。 在此处输入图像描述

This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden.这让我感到特别惊讶,因为 James 脚本保存的页面的原始 html 似乎表明这些字段仍应隐藏。 See eg the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:请参见下面的 html,它在两个文件中显示相同,但有问题的文本仅出现在 James 脚本保存的浏览器呈现页面中:

<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST. 
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>

Any idea why this is happening?知道为什么会这样吗?

As you noted, Selenium cannot interact with the browser's context menu to use Save as... , so instead to do so, you could use an external automation library like pyautogui .正如您所指出的,Selenium 无法与浏览器的上下文菜单交互以使用Save as... ,因此您可以使用外部自动化库(如pyautogui

pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')

This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter.此代码通过其键盘快捷键CTRL+S打开Save as...窗口,然后通过按 Enter 将网页及其资产保存到默认下载位置。 This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case.此代码还将文件命名为序列,以便为其提供唯一名称,但您可以根据您的用例更改此名称。 If needed, you could additionally change the download location through some extra work with the tab and arrow keys.如果需要,您还可以通过使用 Tab 键和箭头键进行一些额外操作来更改下载位置。

Tested on Ubuntu 18.10;在 Ubuntu 18.10 上测试; depending on your OS you may need to modify the key combination sent.根据您的操作系统,您可能需要修改发送的组合键。


Full code, in which I also added conditional waits to improve speed:完整代码,其中我还添加了条件等待以提高速度:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()

# wait until results are loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')

This is not a perfect solution, but it will get you most of what you need.这不是一个完美的解决方案,但它可以满足您的大部分需求。 You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.您可以通过解析 html 并将任何加载的文件(图像、css、js 等)下载到它们相同的相对路径来复制“另存为完整网页(完整)”的行为。

Most of the javascript won't work due to cross origin request blocking.由于跨源请求阻塞,大多数 javascript 将无法工作。 But the content will look (mostly) the same.但是内容看起来(大部分)是一样的。

This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.这使用requests来保存加载的文件,使用lxml来解析 html,使用os来处理路径。

from selenium import webdriver
import chromedriver_binary
from lxml import html
import requests
import os

driver = webdriver.Chrome()
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' 
base = 'https://blast.ncbi.nlm.nih.gov/'

driver.get(URL)
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()

content = driver.page_source
# write the page content
os.mkdir('page')
with open('page/page.html', 'w') as fp:
    fp.write(content)

# download the referenced files to the same path as in the html
sess = requests.Session()
sess.get(base)            # sets cookies

# parse html
h = html.fromstring(content)
# get css/js files loaded in the head
for hr in h.xpath('head//@href'):
    if not hr.startswith('http'):
        local_path = 'page/' + hr
        hr = base + hr
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
        os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
        fp.write(res.content)

# get image/js files from the body.  skip anything loaded from outside sources
for src in h.xpath('//@src'):
    if not src or src.startswith('http'):
        continue
    local_path = 'page/' + src
    print(local_path)
    src = base + src
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
        os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
        fp.write(res.content)  

You should have a folder called page with a file called page.html in it with the content you are after.您应该有一个名为page的文件夹,其中包含一个名为page.html的文件,其中包含您想要的内容。

Inspired by FThompson's answer above, I came up with the following tool that can download full/complete html for a given page url (see: https://github.com/markfront/SinglePageFullHtml )受上面 FThompson 的回答的启发,我想出了以下工具,可以为给定的页面 url 下载完整/完整的 html(参见: https ://github.com/markfront/SinglePageFullHtml)

UPDATE - follow up with Max's suggestion, below are steps to use the tool:更新 - 跟进 Max 的建议,以下是使用该工具的步骤:

  1. Clone the project, then run maven to build:克隆项目,然后运行 maven 来构建:
$> git clone https://github.com/markfront/SinglePageFullHtml.git

$> cd ~/git/SinglePageFullHtml
$> mvn clean compile package
  1. Find the generated jar file in target folder: SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar在目标文件夹中找到生成的 jar 文件:SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar

  2. Run the jar in command line like:在命令行中运行 jar,例如:

$> java -jar .target/SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar <page_url>
  1. The result file name will have a prefix "FP, followed by the hashcode of the page url, with file extension ".html". It will be found in either folder "/tmp" (which you can get by System.getProperty("java.io.tmp"). If not, try find it in your home dir or System.getProperty("user.home") in Java).结果文件名将有一个前缀“FP”,后跟页面 url 的哈希码,文件扩展名为“.html”。它可以在任一文件夹“/tmp”中找到(您可以通过 System.getProperty(" java.io.tmp")。如果没有,请尝试在您的主目录或 Java 中的 System.getProperty("user.home") 中找到它)。

  2. The result file will be a big fat self-contained html file that includes everything (css, javascript, images, etc.) referred to by the original html source.结果文件将是一个大而独立的 html 文件,其中包含原始 html 源引用的所有内容(css、javascript、图像等)。

I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem.我会建议您尝试使用sikulix ,这是一个基于图像的自动化工具,用于在 PC 操作系统中操作任何小部件,它支持 python 语法并使用命令行运行,这可能是解决您问题的最简单方法。 All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).您需要做的只是给它一个屏幕截图,在您的 python 自动化脚本中调用 sikulix 脚本(使用 OS.system("xxxx") 或子进程...)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM