簡體   English   中英

如何從包含相同HTML標簽的字符串列表中刪除HTML標簽?

[英]How do I remove HTML tags from a list of strings that contain the same HTML tags?

我發現了十二種方法來刪除html並清除以字符串格式在線抓取的數據。 我的問題是我抓取的數據是列表格式。

下面的代碼將打印出包含html標簽的數據列表。

price = soup.findAll("span", {"class": "s-item__price"})

我嘗試在末尾使用.get_text刪除html標記,但由於列表而不是字符串,數據出現屬性錯誤

price = soup.findAll("span", {"class": "s-item__price"}).get_text()

這是下面的完整腳本。

import requests
import re
from bs4 import BeautifulSoup 
from html.parser import HTMLParser

URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

price = soup.findAll("span", {"class": "s-item__price"}).get_text()

print(price)

input('Press ENTER to exit')

我想在沒有API的情況下完成此操作**

您可以創建一個for .get_text().get_text()調用.get_text()

import requests
from bs4 import BeautifulSoup

URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

for price in soup.findAll("span", {"class": "s-item__price"}):
    print(price.get_text(strip=True))

印刷品:

$449.99
$449.99
$414.46
$399.00
$399.95
$349.99
$449.00
$585.00
...and son on.

編輯:要打印標題和價格,您可以例如:

for tag in soup.select('li.s-item:has(.s-item__title):has(.s-item__price)'):
    print('{: <10} {}'.format(tag.select_one('.s-item__price').get_text(strip=True),
                              tag.select_one('.s-item__title').get_text(strip=True, separator=' ')))

印刷品:

$449.99    SPONSORED OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$449.99    OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$414.46    Oneplus 6t dual sim 256gb midnight black black 6.41" unlocked ram 8gb a6010
$399.00    SPONSORED OnePlus 6T A6013, Clean ESN, Unknown Carrier, Coffee
$399.95    SPONSORED OnePlus 6T 4G LTE 6.41" 128GB ROM 8GB RAM A6013 (T-Mobile)  - Mirror Black
$349.99    ONEPLUS 6T - BLACK - 128GB - (T-MOBILE) ~3841
$449.00    OnePlus 6t McLaren Edition Unlocked 256GB 10GB RAM Original Accessories Included
$434.83    OnePlus 6T 8 GB RAM 128 GB UK SIM-Free Smartphone (ML3658)
$265.74    Oneplus 6t
$241.58    New Listing OnePlus 6T 8GB 128GB UNLOCKED
$419.95    NEW IN BOX Oneplus 6T  128GB  Mirror Black (T-mobile/Metro PCS/Mint) 8gb RAM
$435.99    OnePlus 6T - 128GB 6GB RAM - Mirror Black (Unlocked) Global Version

... and so on.

您不能在列表本身上使用get_text(),但可以在單個元素上使用。

price_elems = soup.findAll("span", {"class": "s-item__price"})
prices = [elem.get_text() for elem in price_elems]

這樣,您便可以在可以打印的標簽之間找到實際文本的列表! 希望有幫助! :)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM