[英]How do I remove HTML tags from a list of strings that contain the same HTML tags?
我發現了十二種方法來刪除html並清除以字符串格式在線抓取的數據。 我的問題是我抓取的數據是列表格式。
下面的代碼將打印出包含html標簽的數據列表。
price = soup.findAll("span", {"class": "s-item__price"})
我嘗試在末尾使用.get_text刪除html標記,但由於列表而不是字符串,數據出現屬性錯誤
price = soup.findAll("span", {"class": "s-item__price"}).get_text()
這是下面的完整腳本。
import requests
import re
from bs4 import BeautifulSoup
from html.parser import HTMLParser
URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.findAll("span", {"class": "s-item__price"}).get_text()
print(price)
input('Press ENTER to exit')
我想在沒有API的情況下完成此操作**
您可以創建一個for
.get_text()
並.get_text()
調用.get_text()
:
import requests
from bs4 import BeautifulSoup
URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
for price in soup.findAll("span", {"class": "s-item__price"}):
print(price.get_text(strip=True))
印刷品:
$449.99
$449.99
$414.46
$399.00
$399.95
$349.99
$449.00
$585.00
...and son on.
編輯:要打印標題和價格,您可以例如:
for tag in soup.select('li.s-item:has(.s-item__title):has(.s-item__price)'):
print('{: <10} {}'.format(tag.select_one('.s-item__price').get_text(strip=True),
tag.select_one('.s-item__title').get_text(strip=True, separator=' ')))
印刷品:
$449.99 SPONSORED OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$449.99 OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$414.46 Oneplus 6t dual sim 256gb midnight black black 6.41" unlocked ram 8gb a6010
$399.00 SPONSORED OnePlus 6T A6013, Clean ESN, Unknown Carrier, Coffee
$399.95 SPONSORED OnePlus 6T 4G LTE 6.41" 128GB ROM 8GB RAM A6013 (T-Mobile) - Mirror Black
$349.99 ONEPLUS 6T - BLACK - 128GB - (T-MOBILE) ~3841
$449.00 OnePlus 6t McLaren Edition Unlocked 256GB 10GB RAM Original Accessories Included
$434.83 OnePlus 6T 8 GB RAM 128 GB UK SIM-Free Smartphone (ML3658)
$265.74 Oneplus 6t
$241.58 New Listing OnePlus 6T 8GB 128GB UNLOCKED
$419.95 NEW IN BOX Oneplus 6T 128GB Mirror Black (T-mobile/Metro PCS/Mint) 8gb RAM
$435.99 OnePlus 6T - 128GB 6GB RAM - Mirror Black (Unlocked) Global Version
... and so on.
您不能在列表本身上使用get_text(),但可以在單個元素上使用。
price_elems = soup.findAll("span", {"class": "s-item__price"})
prices = [elem.get_text() for elem in price_elems]
這樣,您便可以在可以打印的標簽之間找到實際文本的列表! 希望有幫助! :)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.