繁体   English   中英

如何从包含相同HTML标签的字符串列表中删除HTML标签?

[英]How do I remove HTML tags from a list of strings that contain the same HTML tags?

我发现了十二种方法来删除html并清除以字符串格式在线抓取的数据。 我的问题是我抓取的数据是列表格式。

下面的代码将打印出包含html标签的数据列表。

price = soup.findAll("span", {"class": "s-item__price"})

我尝试在末尾使用.get_text删除html标记,但由于列表而不是字符串,数据出现属性错误

price = soup.findAll("span", {"class": "s-item__price"}).get_text()

这是下面的完整脚本。

import requests
import re
from bs4 import BeautifulSoup 
from html.parser import HTMLParser

URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

price = soup.findAll("span", {"class": "s-item__price"}).get_text()

print(price)

input('Press ENTER to exit')

我想在没有API的情况下完成此操作**

您可以创建一个for .get_text().get_text()调用.get_text()

import requests
from bs4 import BeautifulSoup

URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

for price in soup.findAll("span", {"class": "s-item__price"}):
    print(price.get_text(strip=True))

印刷品:

$449.99
$449.99
$414.46
$399.00
$399.95
$349.99
$449.00
$585.00
...and son on.

编辑:要打印标题和价格,您可以例如:

for tag in soup.select('li.s-item:has(.s-item__title):has(.s-item__price)'):
    print('{: <10} {}'.format(tag.select_one('.s-item__price').get_text(strip=True),
                              tag.select_one('.s-item__title').get_text(strip=True, separator=' ')))

印刷品:

$449.99    SPONSORED OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$449.99    OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$414.46    Oneplus 6t dual sim 256gb midnight black black 6.41" unlocked ram 8gb a6010
$399.00    SPONSORED OnePlus 6T A6013, Clean ESN, Unknown Carrier, Coffee
$399.95    SPONSORED OnePlus 6T 4G LTE 6.41" 128GB ROM 8GB RAM A6013 (T-Mobile)  - Mirror Black
$349.99    ONEPLUS 6T - BLACK - 128GB - (T-MOBILE) ~3841
$449.00    OnePlus 6t McLaren Edition Unlocked 256GB 10GB RAM Original Accessories Included
$434.83    OnePlus 6T 8 GB RAM 128 GB UK SIM-Free Smartphone (ML3658)
$265.74    Oneplus 6t
$241.58    New Listing OnePlus 6T 8GB 128GB UNLOCKED
$419.95    NEW IN BOX Oneplus 6T  128GB  Mirror Black (T-mobile/Metro PCS/Mint) 8gb RAM
$435.99    OnePlus 6T - 128GB 6GB RAM - Mirror Black (Unlocked) Global Version

... and so on.

您不能在列表本身上使用get_text(),但可以在单个元素上使用。

price_elems = soup.findAll("span", {"class": "s-item__price"})
prices = [elem.get_text() for elem in price_elems]

这样,您便可以在可以打印的标签之间找到实际文本的列表! 希望有帮助! :)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM