无法通过BeautifulSoup / LXML解析HTML

Question

I have an HTML page and I want to find some items of it. 我有一个HTML页面，我想查找其中的一些项。 I am finding it hard to apply beautifulsoup or lxml 我发现很难应用beautifulsoup或lxml

HTML page: HTML页面：

<li class="context-card">
    <div class="episode" data-id="t1">
        <span class="av-play">Title to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t2">
        <span class="av-play">Title2 to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t3">
        <span class="av-play">Title3 to scrape</span>
    </div>
</li>

How to get all these 3 Ids and titles in a different dictionary within a list 如何在列表中的不同词典中获取所有这3个ID和标题

[{'id':'t1', 'title': 'Title to scrape'}, {'id':'t2', 'title': 'Title2 to scrape'}, {'id':'t3', 'title': 'Title3 to scrape'}]

Answer 1

All the titles and IDs you need are located inside the <span> tag with the class="episode" attribute. 您需要的所有标题和ID都位于<span>标记内，具有class="episode"属性。 So, your job is to iterate over all of those tags and get the 'data-id' of the div tag and the text of its inner span tag. 因此，您的工作是遍历所有这些标签并获取div标签的'data-id'及其内部span标签的text 。

Code: 码：

html = '''
<li class="context-card">
    <div class="episode" data-id="t1">
        <span class="av-play">Title to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t2">
        <span class="av-play">Title2 to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t3">
        <span class="av-play">Title3 to scrape</span>
    </div>
</li>
'''
soup = BeautifulSoup(html, 'lxml')

title_list = []
for ep in soup.find_all('div', class_='episode'):
    curr_dict = {'id': ep['data-id'], 'title': ep.span.text}
    title_list.append(curr_dict)

print(title_list)

Output: 输出：

[{'id': 't1', 'title': 'Title to scrape'},
 {'id': 't2', 'title': 'Title2 to scrape'},
 {'id': 't3', 'title': 'Title3 to scrape'}]

Or, the same can be done using a list comprehension: 或者，可以使用列表推导完成相同操作：

title_list = [{'id': ep['data-id'], 'title': ep.span.text} for ep in soup.find_all('div', class_='episode')]

无法通过BeautifulSoup / LXML解析HTML

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-04-03 11:59:13

无法通过BeautifulSoup / LXML解析HTML

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-04-03 11:59:13

解决方案1
0 已采纳 2018-04-03 11:59:13