简体   繁体   English

如何使用BeautifulSoup从长标签中获取部分文本

[英]how to get partial text from a long tag using BeautifulSoup

I have been studying a shopping website, and I want to extract the brandname and the product name from its html code like the following: 我一直在研究购物网站,我想从其html代码中提取品牌名称和产品名称,如下所示:

<h1 class="product-name elim-suites">Chantecaille<span itemprop="name" >Limited Edition Protect the Lion Eye Palette</span></h1>

I tried: results = soup.findAll("h1", {"class" : "product-name elim-suites"})[0].text 我尝试过: results = soup.findAll("h1", {"class" : "product-name elim-suites"})[0].text

and got: u'ChantecailleLimited Edition Protect the Lion Eye Palette' 并得到: u'ChantecailleLimited Edition Protect the Lion Eye Palette'

As you can see, Chantecaille is the brandname, the rest is the product name, but they are now sticked to each other, any suggestion? 正如您所看到的,Chantecaille是品牌名称,其余是产品名称,但是现在它们彼此粘合了,有什么建议吗? Thank you! 谢谢!

You can use previous_sibling , which gets the previous node that has the same parent (same level in the parse tree). 您可以使用previous_sibling ,它获取具有相同父级(解析树中的同一级别)的上一个节点。

Also, instead of findAll , when you are searching for a single element, use find . 另外,当您搜索单个元素时,请使用find代替findAll

item_span = soup.find("h1", {"class" : "product-name elim-suites"}).find("span")

product_name = item_span.previous_sibling
brand_name = item_span.text

print product_name
print brand_name

Output: 输出:

Chantecaille
Limited Edition Protect the Lion Eye Palette

You could use get_text and pass a character to separate the text or pull the text using . h1.find(text=True, recursive=False) 您可以使用get_text并传递一个字符来分隔文本或使用来拉文本. h1.find(text=True, recursive=False) . h1.find(text=True, recursive=False) on the h1 and pull the text from the span directly: . h1.find(text=True, recursive=False)h1和拉直接从跨度的文字:

In [1]: h ="""<h1 class="product-name elim-suites">Chantecaille<span itemprop="name" >Limited Edition Protect the Lion Eye Palette
   ...: </span></h1>"""

In [2]: from bs4 import BeautifulSoup

In [3]: soup = BeautifulSoup(h, "html.parser")

In [4]: h1 = soup.select_one("h1.product-name.elim-suites")

In [5]: print(h1.get_text("\n"))
Chantecaille
Limited Edition Protect the Lion Eye Palette


In [6]: prod, desc = h1.find(text=True, recursive=False), h1.span.text

In [7]: print(prod, desc)
(u'Chantecaille', u'Limited Edition Protect the Lion Eye Palette\n')

Or if text could appear after the span also use find_all : 或者,如果文本可能出现在跨度之后,请使用find_all

In [8]: h ="""<h1 class="product-name elim-suites">Chantecaille
<span itemprop="name" >Limited Edition Protect the Lion Eye Palette</span>other text</h1>"""


In [9]: from bs4 import BeautifulSoup

In [10]: soup = BeautifulSoup(h, "html.parser")

In [11]: h1 = soup.select_one("h1.product-name.elim-suites")

In [12]: print(h1.get_text("\n"))
Chantecaille
Limited Edition Protect the Lion Eye Palette
other text

In [13]: prod, desc = " ".join(h1.find_all(text=True, recursive=False)), h1.span.text

In [14]: 

In [14]: print(prod, desc)
(u'Chantecaille other text', u'Limited Edition Protect the Lion Eye Palette')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM