Python - 將bs4用於嵌套的html標記

Question

我需要從這個HTML代碼中打印出美國和加拿大的字樣：

<div class="txt-block">
    <h4 class="inline">Country:</h4>
    <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
    <span class="ghost">|</span>
    <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
</div>

如何用bs4獲取單詞？ 我用谷歌搜索它，但我沒有發現任何有用的東西。

Answer 1

如果這就是你所擁有的，你可以為每個標簽使用get_text。 請試試這個

from bs4 import BeautifulSoup
html="""<div class="txt-block">
    <h4 class="inline">Country:</h4>
        <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
              <span class="ghost">|</span>
        <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
    </div>"""
soup = BeautifulSoup(html, 'html.parser')
[atag.get_text() for atag in soup.find_all('a')]

Answer 2

要獲取文本，以下代碼將起作用：

from bs4 import BeautifulSoup
html_string = """<div class="txt-block">
    <h4 class="inline">Country:</h4>
        <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
              <span class="ghost">|</span>
        <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
    </div>"""

soup = BeautifulSoup(html_string)
print([node.string for node in soup.find_all('a', attrs={"itemprop" : "url"})] )

上面的代碼將導致：

[u'USA', u'Canada']

您可以在此處參閱BeautifulSoup 文檔。 它非常易於使用和直接使用。

此外，您在lxml的幫助下使用它，比BeautifulSoup快一個數量級。

from lxml import html
html_string = """<div class="txt-block">
    <h4 class="inline">Country:</h4>
        <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
              <span class="ghost">|</span>
        <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
    </div>"""

root = html.fromstring(html_string)
print(root.xpath('//a[@itemprop="url"]//text()'))

這也將導致：

['USA', 'Canada']

Answer 3

簡單方法findAll可用於單獨提取國家/地區名稱。 這是Python 3中的解決方案代碼：

from bs4 import BeautifulSoup
html ="""
<div class="txt-block">
    <h4 class="inline">Country:</h4>
    <a href="/search/title?country_of_origin=us&amp;ref_=tt_dt_dt" itemprop="url">USA</a>
    <span class="ghost">|</span>
    <a href="/search/title?country_of_origin=ca&amp;ref_=tt_dt_dt" itemprop="url">Canada</a>
</div>
"""
soup = BeautifulSoup(html,"html.parser")
for i in soup.findAll("a"):
    print(i.text)

執行上面的代碼將為您提供所需的結果：

USA
Canada

Python - 將bs4用於嵌套的html標記

問題描述

3 個解決方案

解決方案1
0 2017-03-19 22:31:27

解決方案2
0 2017-03-19 22:32:35

解決方案3
0 2018-02-12 19:15:42

Python - 將bs4用於嵌套的html標記

問題描述

3 個解決方案

解決方案1 0 2017-03-19 22:31:27

解決方案2 0 2017-03-19 22:32:35

解決方案3 0 2018-02-12 19:15:42

解決方案1
0 2017-03-19 22:31:27

解決方案2
0 2017-03-19 22:32:35

解決方案3
0 2018-02-12 19:15:42