簡體   English   中英

如何使用 Python 遍歷標簽?

[英]How can I iterate through tags using Python?

我想遍歷一些 html 並將數據存儲到字典中。 每次迭代都以:

<h1 class="docDisplay" id="docTitle">


html = '<html><body><h1 class="docDisplay" id="docTitle">Data1</h1><p>other data<\p><h1 class="docDisplay" id="docTitle">Data2</h1><p>other data2<\p></html>'

newdoc = soup.find('h1', id="docTitle")
title = newdoc.findNext(text=True)
data = title.findAllNext('p',text=True)
data_dict = {}
data_dict[title] = {'data': data}
print data_dict

現在,output 是

{u'Data1': {'data': [u'other data<\\p>', u'Data2', u'other data2<\\p>']}}

我希望 output 是:

{u'Data1': {'data': [u'other data<\\p>']}, u'Data2': {'data': [u'other data2<\\p>']}}

到達新的 h1 標簽后,我不知道如何重新開始。 有任何想法嗎?

為了匹配每個 header 下的段落文本,我會嘗試這樣的事情(您可能必須根據您想要的確切 output 格式來修改它):

    from BeautifulSoup import BeautifulSoup

    html = """ 

      <h1 class="docDisplay" id="docTitle">Data1</h1>
      <p>other data</p>
      <p>Another paragraph under the first heading.</p>
      <h1 class="docDisplay" id="docTitle">Data2</h1>
      <p>other data2</p>
      <div><p>This paragraph is NOT a sibling of the header</p></div>

soup = BeautifulSoup(html)

data_dict = {}
stuff_under_current_heading = []

firstHeader = soup.find('h1', id="docTitle")
for tag in [firstHeader] + firstHeader.findNextSiblings():
    if tag.name == 'h1':
        stuff_under_current_heading = []
        # I chose to strip excess whitespace from the header name:
        data_dict[tag.string.strip()] = {'data': stuff_under_current_heading}
        # Modifying the list modifies the value in the dictionary.
    # Take every <p> tag encountered between here and the next heading
    # and associate it with the most recently-seen <h1> tag.
    elif tag.name == 'p':
    # Include <p> tags that are not siblings of the <h1> tag but
    # are still part of the content under the header.
        stuff_under_current_heading.extend(tag.findAll('p', text=True))

print data_dict


{u'Data1': {'data': [u'other data', u'Another paragraph under the first heading.']},   
 u'Data2': {'data': [u'other data2', u'This paragraph is NOT a sibling of the header']}}

@samplebias:@Lynch 是對的。 如果 OP 沒有正確關閉他/她的標簽,他們根本無法期望解析器能夠讀懂他們的想法。

嘗試修復您的 HTML,它可能會工作。 =)


聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

粵ICP備18138465號  © 2020-2024 STACKOOM.COM