如何從包含相同HTML標簽的字符串列表中刪除HTML標簽？

Question

我發現了十二種方法來刪除html並清除以字符串格式在線抓取的數據。 我的問題是我抓取的數據是列表格式。

下面的代碼將打印出包含html標簽的數據列表。

price = soup.findAll("span", {"class": "s-item__price"})

我嘗試在末尾使用.get_text刪除html標記，但由於列表而不是字符串，數據出現屬性錯誤

price = soup.findAll("span", {"class": "s-item__price"}).get_text()

這是下面的完整腳本。

import requests
import re
from bs4 import BeautifulSoup 
from html.parser import HTMLParser

URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

price = soup.findAll("span", {"class": "s-item__price"}).get_text()

print(price)

input('Press ENTER to exit')

我想在沒有API的情況下完成此操作**

Answer 1

您可以創建一個for .get_text()並.get_text()調用.get_text() ：

import requests
from bs4 import BeautifulSoup

URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

for price in soup.findAll("span", {"class": "s-item__price"}):
    print(price.get_text(strip=True))

印刷品：

$449.99
$449.99
$414.46
$399.00
$399.95
$349.99
$449.00
$585.00
...and son on.

編輯：要打印標題和價格，您可以例如：

for tag in soup.select('li.s-item:has(.s-item__title):has(.s-item__price)'):
    print('{: <10} {}'.format(tag.select_one('.s-item__price').get_text(strip=True),
                              tag.select_one('.s-item__title').get_text(strip=True, separator=' ')))

印刷品：

$449.99    SPONSORED OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$449.99    OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$414.46    Oneplus 6t dual sim 256gb midnight black black 6.41" unlocked ram 8gb a6010
$399.00    SPONSORED OnePlus 6T A6013, Clean ESN, Unknown Carrier, Coffee
$399.95    SPONSORED OnePlus 6T 4G LTE 6.41" 128GB ROM 8GB RAM A6013 (T-Mobile)  - Mirror Black
$349.99    ONEPLUS 6T - BLACK - 128GB - (T-MOBILE) ~3841
$449.00    OnePlus 6t McLaren Edition Unlocked 256GB 10GB RAM Original Accessories Included
$434.83    OnePlus 6T 8 GB RAM 128 GB UK SIM-Free Smartphone (ML3658)
$265.74    Oneplus 6t
$241.58    New Listing OnePlus 6T 8GB 128GB UNLOCKED
$419.95    NEW IN BOX Oneplus 6T  128GB  Mirror Black (T-mobile/Metro PCS/Mint) 8gb RAM
$435.99    OnePlus 6T - 128GB 6GB RAM - Mirror Black (Unlocked) Global Version

... and so on.

Answer 2

您不能在列表本身上使用get_text（），但可以在單個元素上使用。

price_elems = soup.findAll("span", {"class": "s-item__price"})
prices = [elem.get_text() for elem in price_elems]

這樣，您便可以在可以打印的標簽之間找到實際文本的列表！ 希望有幫助！ :)

如何從包含相同HTML標簽的字符串列表中刪除HTML標簽？

問題描述

2 個解決方案

解決方案1
0 已采納 2019-08-12 04:47:08

解決方案2
0 2019-08-12 04:47:32

如何從包含相同HTML標簽的字符串列表中刪除HTML標簽？

問題描述

2 個解決方案

解決方案1 0 已采納 2019-08-12 04:47:08

解決方案2 0 2019-08-12 04:47:32

解決方案1
0 已采納 2019-08-12 04:47:08

解決方案2
0 2019-08-12 04:47:32