如何使用 Beautiful Soup 从网站上刮取 SVG 元素？

Question

from bs4 import BeautifulSoup
import requests
import random

id_url = "https://codeforces.com/profile/akash77"
id_headers = {
    "User-Agent": 'Mozilla/5.0(Windows NT 6.1Win64x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 87.0 .4280 .141 Safari / 537.36 '}
id_page = requests.get(id_url, headers=id_headers)
id_soup = BeautifulSoup(id_page.content, 'html.parser')

id_soup = id_soup.find('svg')
print(id_soup)

I'm getting None as the output for this.为此，我得到了None作为 output 。

If I parse the <div> element in which this <svg> tag is contained, the contents of the <div> element are not getting printed.如果我解析包含此<svg>标记的 < <div> <div>元素，则不会打印 <div> 元素的内容。 The find() works for all HTML tags except the SVG tag. find()适用于除 SVG 标签外的所有 HTML 标签。

Answer 1

svg tag is not included in the source code, it is rendered by Javascript. svg 标签不包含在源代码中，它由 Javascript 渲染。

Answer 2

The webpage is rendered dynamically with Javascript, so you will need selenium to get the rendered page.该网页使用 Javascript 动态呈现，因此您将需要selenium来获取呈现的页面。

First, install the libraries首先，安装库

pip install selenium
pip install webdriver-manager

Then, you can use it to access the full page然后，您可以使用它访问整个页面

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.get('https://codeforces.com/profile/akash77')
elements = driver.find_elements(By.XPATH, '//*[@id="userActivityGraph"]')

Elements is a selenium WebElement, so we will need to get HTML out of it. Elements 是一个 selenium WebElement，所以我们需要从中取出 HTML。

svg = [WebElement.get_attribute('innerHTML') for WebElement in elements]

This gives you svg and all elements inside it.这为您提供了 svg 和其中的所有元素。

Sometimes, you need to run a browser in headless mode (without opening a chrome UI), for that you can pass a 'headless' option to the driver.有时，您需要在无头模式下运行浏览器（无需打开 chrome UI），因为您可以将“无头”选项传递给驱动程序。

from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('headless')

# then pass options to the driver

driver = webdriver.Chrome(service=s, options=options)

Answer 3

If you just want the data it is there in the html, this isn't pretty but it works and much quicker and easier than browser automation:如果您只想要 html 中的数据，这并不漂亮，但它比浏览器自动化更快速、更容易：

import requests
import json

url = 'https://codeforces.com/profile/akash77'

resp = requests.get(url)

start = "$('#userActivityGraph').empty().calendar_yearview_blocks("
end = "start_monday: false"

s = resp.text
svg_data = s[s.find(start)+len(start):s.rfind(end)].strip()[:-1].replace('items','"items"').replace('data','"data"').replace('\n','').replace('\t','').replace(' ','') #get the token out the html
broken = svg_data+'}'

json_data = json.loads(broken)
print(json_data)

如何使用 Beautiful Soup 从网站上刮取 SVG 元素？

问题描述

3 个解决方案

解决方案1
1 2022-01-11 06:40:03

解决方案2
1 已采纳 2022-01-11 07:03:56

解决方案3
1 2022-01-11 14:03:25

如何使用 Beautiful Soup 从网站上刮取 SVG 元素？

问题描述

3 个解决方案

解决方案1 1 2022-01-11 06:40:03

解决方案2 1 已采纳 2022-01-11 07:03:56

解决方案3 1 2022-01-11 14:03:25

解决方案1
1 2022-01-11 06:40:03

解决方案2
1 已采纳 2022-01-11 07:03:56

解决方案3
1 2022-01-11 14:03:25