如何從 HTML 頁面但從元素本身提取或抓取數據

Question

目前我使用 lxml 來解析 html 文檔以從 HTML 元素中獲取數據，但是有一個新的挑戰，在 HTML 元素中存儲了一個數據作為評級

https://i.stack.imgur.com/bwGle.png

<p data-rating="3">
                                <span class="glyphicon glyphicon-star xh-highlight"></span>
                                <span class="glyphicon glyphicon-star xh-highlight"></span>
                                <span class="glyphicon glyphicon-star xh-highlight"></span>
                            </p>

它很容易在標簽之間提取文本，但在標簽內沒有想法。 你有什么建議？

挑戰我想提取“3” URL: https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops

兄弟，加布里埃爾。

Answer 1

試試下面的腳本：

from bs4 import BeautifulSoup
import requests

BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("div", {"class":"ratings"}):
    # get all child from the tags
    for h in tag.children:
        # convert to string data type
        s = h.encode('utf-8').decode("utf-8") 

        # find the tag with data-rating and get text after the keyword
        m = re.search('(?<=data-rating=)(.*)', s)

        # check if not None
        if m:
            #print the text after data-rating and remove last char
            print(m.group()[:-1])

Answer 2

如果我正確理解您的問題和評論，則以下內容應提取該頁面中的所有評級：

import lxml.html
import requests

BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

html = requests.get(BASE_URL)
root = lxml.html.fromstring(html.text)
targets = root.xpath('//p[./span[@class]]/@data-rating')

例如：

targets[0]

output

3

如何從 HTML 頁面但從元素本身提取或抓取數據

問題描述

2 個解決方案

解決方案1
0 2019-11-15 20:10:11

解決方案2
0 已采納 2019-11-18 20:36:11

如何從 HTML 頁面但從元素本身提取或抓取數據

問題描述

2 個解決方案

解決方案1 0 2019-11-15 20:10:11

解決方案2 0 已采納 2019-11-18 20:36:11

解決方案1
0 2019-11-15 20:10:11

解決方案2
0 已采納 2019-11-18 20:36:11