[英]How do I remove HTML tags from a list of strings that contain the same HTML tags?
我发现了十二种方法来删除html并清除以字符串格式在线抓取的数据。 我的问题是我抓取的数据是列表格式。
下面的代码将打印出包含html标签的数据列表。
price = soup.findAll("span", {"class": "s-item__price"})
我尝试在末尾使用.get_text删除html标记,但由于列表而不是字符串,数据出现属性错误
price = soup.findAll("span", {"class": "s-item__price"}).get_text()
这是下面的完整脚本。
import requests
import re
from bs4 import BeautifulSoup
from html.parser import HTMLParser
URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.findAll("span", {"class": "s-item__price"}).get_text()
print(price)
input('Press ENTER to exit')
我想在没有API的情况下完成此操作**
您可以创建一个for
.get_text()
并.get_text()
调用.get_text()
:
import requests
from bs4 import BeautifulSoup
URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
for price in soup.findAll("span", {"class": "s-item__price"}):
print(price.get_text(strip=True))
印刷品:
$449.99
$449.99
$414.46
$399.00
$399.95
$349.99
$449.00
$585.00
...and son on.
编辑:要打印标题和价格,您可以例如:
for tag in soup.select('li.s-item:has(.s-item__title):has(.s-item__price)'):
print('{: <10} {}'.format(tag.select_one('.s-item__price').get_text(strip=True),
tag.select_one('.s-item__title').get_text(strip=True, separator=' ')))
印刷品:
$449.99 SPONSORED OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$449.99 OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$414.46 Oneplus 6t dual sim 256gb midnight black black 6.41" unlocked ram 8gb a6010
$399.00 SPONSORED OnePlus 6T A6013, Clean ESN, Unknown Carrier, Coffee
$399.95 SPONSORED OnePlus 6T 4G LTE 6.41" 128GB ROM 8GB RAM A6013 (T-Mobile) - Mirror Black
$349.99 ONEPLUS 6T - BLACK - 128GB - (T-MOBILE) ~3841
$449.00 OnePlus 6t McLaren Edition Unlocked 256GB 10GB RAM Original Accessories Included
$434.83 OnePlus 6T 8 GB RAM 128 GB UK SIM-Free Smartphone (ML3658)
$265.74 Oneplus 6t
$241.58 New Listing OnePlus 6T 8GB 128GB UNLOCKED
$419.95 NEW IN BOX Oneplus 6T 128GB Mirror Black (T-mobile/Metro PCS/Mint) 8gb RAM
$435.99 OnePlus 6T - 128GB 6GB RAM - Mirror Black (Unlocked) Global Version
... and so on.
您不能在列表本身上使用get_text(),但可以在单个元素上使用。
price_elems = soup.findAll("span", {"class": "s-item__price"})
prices = [elem.get_text() for elem in price_elems]
这样,您便可以在可以打印的标签之间找到实际文本的列表! 希望有帮助! :)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.