簡體   English   中英

如何在beautifulsoup中按順序獲取打印標簽內容?

[英]How to get print the content of tags in order in beautifulsoup?

我正在嘗試從下面發布的html頁面獲取文本。 我嘗試了循環,但沒有按順序打印字符串。 我需要按如下順序打印字符串:

            text1:text_1.1
            text2:text2.2,2.2
            ...

我需要在上面打印輸出。

<ul> 
   <li>
    <b>text1:</b>
    <a><a href="search.php?origin=">text_1.1</a>
    </li>
   <li>
    <b>text2</b>
    <a href="search.php?origin=">text_2.1</a>
    <a href="search.php?origin">text_2.2</a>
   </li>
   <li>
    <b>text4</b>
    <a href="search.php?origin=">text_4.1</a>
    <a href="search.php?origin=">text_4.2</a>
    <a href="search.php?origin=">text_4.3</a>
    <a href="search.php?origin=">text_4.4</a>
   </li>
   <li>
    <b>text5</b>
     <a href="search.php?origin=">text5.1</a>
   </li>
    <li>
    <b>text6</b>
    <a href="search.php?origin=">text6.1</a>
    <a href="search.php?origin=">text6.2</a>
    <a href="search.php?origin=">text6.3</a>
   <li>
    <b>text7</b>
    <a href="search.php?origin=">text7.1</a>
    <font color="green">text7.2</font>          
    </li>
   <li>
    <b>text8</b>
    <a href="dwres.php?resource=">2 </a>
  </ul>

查找所有<li>元素,以便您可以按<b>標記對它們的內容進行分組。 您可能希望使用字典來映射它們,但是要保留文檔順序,可以使用collections.OrderedDict()對象:

from collections import OrderedDict

results = OrderedDict()

for li in soup.find_all('li'):
    bold = li.b
    if bold is None:
        continue
    results[bold.get_text(strip=True)] = [
        link.get_text(strip=True) for link in li.find_all('a')
    ]

演示:

>>> from bs4 import BeautifulSoup
>>> from collections import OrderedDict
>>> soup = BeautifulSoup('''\
...     <ul> 
...        <li>
...         <b>text1:</b>
...         <a><a href="search.php?origin=">text_1.1</a>
...         </li>
...        <li>
...         <b>text2</b>
...         <a href="search.php?origin=">text_2.1</a>
...         <a href="search.php?origin">text_2.2</a>
...        </li>
...        <li>
...         <b>text4</b>
...         <a href="search.php?origin=">text_4.1</a>
...         <a href="search.php?origin=">text_4.2</a>
...         <a href="search.php?origin=">text_4.3</a>
...         <a href="search.php?origin=">text_4.4</a>
...        </li>
...        <li>
...         <b>text5</b>
...          <a href="search.php?origin=">text5.1</a>
...        </li>
...         <li>
...         <b>text6</b>
...         <a href="search.php?origin=">text6.1</a>
...         <a href="search.php?origin=">text6.2</a>
...         <a href="search.php?origin=">text6.3</a>
...        <li>
...         <b>text7</b>
...         <a href="search.php?origin=">text7.1</a>
...         <font color="green">text7.2</font>          
...         </li>
...        <li>
...         <b>text8</b>
...         <a href="dwres.php?resource=">2 </a>
...       </ul>
... ''')
>>> results = OrderedDict()
>>> for li in soup.find_all('li'):
...     bold = li.b
...     if bold is None:
...         continue
...     results[bold.get_text(strip=True)] = [
...         link.get_text(strip=True) for link in li.find_all('a')
...     ]
... 
>>> results
OrderedDict([(u'text1:', [u'', u'text_1.1']), (u'text2', [u'text_2.1', u'text_2.2']), (u'text4', [u'text_4.1', u'text_4.2', u'text_4.3', u'text_4.4']), (u'text5', [u'text5.1']), (u'text6', [u'text6.1', u'text6.2', u'text6.3']), (u'text7', [u'text7.1']), (u'text8', [u'2'])])
>>> for key, elems in results.items():
...     print '{}: {}'.format(key, ', '.join(elems))
... 
text1:: , text_1.1
text2: text_2.1, text_2.2
text4: text_4.1, text_4.2, text_4.3, text_4.4
text5: text5.1
text6: text6.1, text6.2, text6.3
text7: text7.1
text8: 2

可以將print集成到循環中,但是通過構建字典,您現在可以進行進一步處理。 將其寫入文件,將其發送到其他地方,等等。

from bs4 import BeautifulSoup

for t in BeautifulSoup(html).find("ul").find_all("a"):
    print(t.text)

輸出:

text_1.1
text_2.1
text_2.2
text_4.1
text_4.2
text_4.3
text_4.4
text5.1
text6.1
text6.2
text6.3
text7.1

如果您同時需要a和b標簽文本:

ul = BeautifulSoup(html).find("ul")
b=  [b.text for  b in ul.find_all("b")]
a = [a.text for  a in ul.find_all("a")]

您將需要決定如何使輸出與您匹配,因為顯然有一個標簽,然后是b。

您還可以使用連接來獲取li標記並訪問a和b標記,以使其與您想要的內容有些接近:

ul = BeautifulSoup(html).find("ul")
li = ul.find_all("li")

for ele in li:
    print("{}:{}".format(ele.b.text,"".join([a.text for a in ele.find_all("a")])))
text1::text_1.1
text2:text_2.1text_2.2
text4:text_4.1text_4.2text_4.3text_4.4
text5:text5.1
text6:text6.1text6.2text6.3
text7:text7.1
text8:2 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM