[英]How can I iterate through tags using Python?
I would like to iterate through some html and store data into a dictionary.我想遍历一些 html 并将数据存储到字典中。 Each iteration starts with:
每次迭代都以:
<h1 class="docDisplay" id="docTitle">
I have the following code:我有以下代码:
html = '<html><body><h1 class="docDisplay" id="docTitle">Data1</h1><p>other data<\p><h1 class="docDisplay" id="docTitle">Data2</h1><p>other data2<\p></html>'
soup=BeautifulSoup(html)
newdoc = soup.find('h1', id="docTitle")
title = newdoc.findNext(text=True)
data = title.findAllNext('p',text=True)
data_dict = {}
data_dict[title] = {'data': data}
print data_dict
Right now, the output is现在,output 是
{u'Data1': {'data': [u'other data<\\p>', u'Data2', u'other data2<\\p>']}}
I would like the output to be:我希望 output 是:
{u'Data1': {'data': [u'other data<\\p>']}, u'Data2': {'data': [u'other data2<\\p>']}}
I can't figure out how to start again once I reach a new h1 tag.到达新的 h1 标签后,我不知道如何重新开始。 Any ideas?
有任何想法吗?
To match the text of the paragraphs under each header, I would try something like this (you may have to modify this depending on the exact output format that you want):为了匹配每个 header 下的段落文本,我会尝试这样的事情(您可能必须根据您想要的确切 output 格式来修改它):
from BeautifulSoup import BeautifulSoup
html = """
<html>
<head>
</head>
<body>
<h1 class="docDisplay" id="docTitle">Data1</h1>
<p>other data</p>
<p>Another paragraph under the first heading.</p>
<h1 class="docDisplay" id="docTitle">Data2</h1>
<p>other data2</p>
<div><p>This paragraph is NOT a sibling of the header</p></div>
</body>
</html>
"""
soup = BeautifulSoup(html)
data_dict = {}
stuff_under_current_heading = []
firstHeader = soup.find('h1', id="docTitle")
for tag in [firstHeader] + firstHeader.findNextSiblings():
if tag.name == 'h1':
stuff_under_current_heading = []
# I chose to strip excess whitespace from the header name:
data_dict[tag.string.strip()] = {'data': stuff_under_current_heading}
# Modifying the list modifies the value in the dictionary.
# Take every <p> tag encountered between here and the next heading
# and associate it with the most recently-seen <h1> tag.
elif tag.name == 'p':
stuff_under_current_heading.append(tag.string)
# Include <p> tags that are not siblings of the <h1> tag but
# are still part of the content under the header.
else:
stuff_under_current_heading.extend(tag.findAll('p', text=True))
print data_dict
This outputs这输出
{u'Data1': {'data': [u'other data', u'Another paragraph under the first heading.']},
u'Data2': {'data': [u'other data2', u'This paragraph is NOT a sibling of the header']}}
@samplebias: @Lynch is right. @samplebias:@Lynch 是对的。 If the OP doesn't close his/her tags properly, they simply cannot expect for the parser to be able to read their mind.
如果 OP 没有正确关闭他/她的标签,他们根本无法期望解析器能够读懂他们的想法。
Try fixing your HTML, it will probably work then.尝试修复您的 HTML,它可能会工作。 =)
=)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.