beautifulsoup .get_text（）对我的HTML解析不够具体

Question

Given the HTML code below I want output just the text of the h1 but not the "Details about ", which is the text of the span (which is encapsulated by the h1). 鉴于下面的HTML代码，我想输出h1的文本而不是“详细信息”，这是span的文本（由h1封装）。

My current output gives: 我当前的输出给出：

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

I would like: 我想要：

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

Here is the HTML I am working with 这是我正在使用的HTML

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

Here is my current code: 这是我目前的代码：

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    print line.get_text()

Note: I do not want to just truncate the string because I would like this code to have some re-usability. 注意：我不想仅截断字符串，因为我希望此代码具有一些可重用性。 What would be best is some code that crops out any text that is bounded by the span. 什么是最好的是一些代码，用于裁剪任何由跨度限制的文本。

Answer 1

You can use extract() to remove all span tags: 您可以使用extract()删除所有span标记：

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

Answer 2

One solution is to check if the string contains html : 一种解决方案是检查字符串是否包含html ：

from bs4 import BeautifulSoup

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if bool(BeautifulSoup(str(content), "html.parser").find()):
            continue

        print content

Another solution (which I prefer) is to check for instance of bs4.element.Tag : 另一种解决方案（我更喜欢）是检查bs4.element.Tag ：

import bs4

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if isinstance(content, bs4.element.Tag):
            continue

        print content

beautifulsoup .get_text（）对我的HTML解析不够具体

问题描述

2 个解决方案

解决方案1
5 已采纳 2015-07-16 22:23:17

解决方案2
0 2015-07-16 21:18:48

beautifulsoup .get_text（）对我的HTML解析不够具体

问题描述

2 个解决方案

解决方案1 5 已采纳 2015-07-16 22:23:17

解决方案2 0 2015-07-16 21:18:48

解决方案1
5 已采纳 2015-07-16 22:23:17

解决方案2
0 2015-07-16 21:18:48