[英]Unable to parse HTML by BeautifulSoup / LXML
I have an HTML page and I want to find some items of it. 我有一个HTML页面,我想查找其中的一些项。 I am finding it hard to apply beautifulsoup or lxml
我发现很难应用beautifulsoup或lxml
HTML page: HTML页面:
<li class="context-card">
<div class="episode" data-id="t1">
<span class="av-play">Title to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t2">
<span class="av-play">Title2 to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t3">
<span class="av-play">Title3 to scrape</span>
</div>
</li>
How to get all these 3 Ids and titles in a different dictionary within a list 如何在列表中的不同词典中获取所有这3个ID和标题
[{'id':'t1', 'title': 'Title to scrape'}, {'id':'t2', 'title': 'Title2 to scrape'}, {'id':'t3', 'title': 'Title3 to scrape'}]
All the titles and IDs you need are located inside the <span>
tag with the class="episode"
attribute. 您需要的所有标题和ID都位于
<span>
标记内,具有class="episode"
属性。 So, your job is to iterate over all of those tags and get the 'data-id'
of the div
tag and the text
of its inner span
tag. 因此,您的工作是遍历所有这些标签并获取
div
标签的'data-id'
及其内部span
标签的text
。
Code: 码:
html = '''
<li class="context-card">
<div class="episode" data-id="t1">
<span class="av-play">Title to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t2">
<span class="av-play">Title2 to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t3">
<span class="av-play">Title3 to scrape</span>
</div>
</li>
'''
soup = BeautifulSoup(html, 'lxml')
title_list = []
for ep in soup.find_all('div', class_='episode'):
curr_dict = {'id': ep['data-id'], 'title': ep.span.text}
title_list.append(curr_dict)
print(title_list)
Output: 输出:
[{'id': 't1', 'title': 'Title to scrape'},
{'id': 't2', 'title': 'Title2 to scrape'},
{'id': 't3', 'title': 'Title3 to scrape'}]
Or, the same can be done using a list comprehension: 或者,可以使用列表推导完成相同操作:
title_list = [{'id': ep['data-id'], 'title': ep.span.text} for ep in soup.find_all('div', class_='episode')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.