简体   繁体   English

如何使用 Beautiful Soup 从网站上刮取 SVG 元素?

[英]How to scrape SVG element from a website using Beautiful Soup?

from bs4 import BeautifulSoup
import requests
import random

id_url = "https://codeforces.com/profile/akash77"
id_headers = {
    "User-Agent": 'Mozilla/5.0(Windows NT 6.1Win64x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 87.0 .4280 .141 Safari / 537.36 '}
id_page = requests.get(id_url, headers=id_headers)
id_soup = BeautifulSoup(id_page.content, 'html.parser')

id_soup = id_soup.find('svg')
print(id_soup)

I'm getting None as the output for this.为此,我得到了None作为 output 。

If I parse the <div> element in which this <svg> tag is contained, the contents of the <div> element are not getting printed.如果我解析包含此<svg>标记的 < <div> <div>元素,则不会打印 <div> 元素的内容。 The find() works for all HTML tags except the SVG tag. find()适用于除 SVG 标签外的所有 HTML 标签。

svg tag is not included in the source code, it is rendered by Javascript. svg 标签不包含在源代码中,它由 Javascript 渲染。

The webpage is rendered dynamically with Javascript, so you will need selenium to get the rendered page.该网页使用 Javascript 动态呈现,因此您将需要selenium来获取呈现的页面。

First, install the libraries首先,安装库

pip install selenium
pip install webdriver-manager

Then, you can use it to access the full page然后,您可以使用它访问整个页面

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.get('https://codeforces.com/profile/akash77')
elements = driver.find_elements(By.XPATH, '//*[@id="userActivityGraph"]')

Elements is a selenium WebElement, so we will need to get HTML out of it. Elements 是一个 selenium WebElement,所以我们需要从中取出 HTML。

svg = [WebElement.get_attribute('innerHTML') for WebElement in elements]

This gives you svg and all elements inside it.这为您提供了 svg 和其中的所有元素。

在此处输入图像描述

Sometimes, you need to run a browser in headless mode (without opening a chrome UI), for that you can pass a 'headless' option to the driver.有时,您需要在无头模式下运行浏览器(无需打开 chrome UI),因为您可以将“无头”选项传递给驱动程序。

from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('headless')

# then pass options to the driver

driver = webdriver.Chrome(service=s, options=options) 

If you just want the data it is there in the html, this isn't pretty but it works and much quicker and easier than browser automation:如果您只想要 html 中的数据,这并不漂亮,但它比浏览器自动化更快速、更容易:

import requests
import json

url = 'https://codeforces.com/profile/akash77'

resp = requests.get(url)

start = "$('#userActivityGraph').empty().calendar_yearview_blocks("
end = "start_monday: false"

s = resp.text
svg_data = s[s.find(start)+len(start):s.rfind(end)].strip()[:-1].replace('items','"items"').replace('data','"data"').replace('\n','').replace('\t','').replace(' ','') #get the token out the html
broken = svg_data+'}'

json_data = json.loads(broken)
print(json_data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM