简体   繁体   English

无法通过BeautifulSoup / LXML解析HTML

[英]Unable to parse HTML by BeautifulSoup / LXML

I have an HTML page and I want to find some items of it. 我有一个HTML页面,我想查找其中的一些项。 I am finding it hard to apply beautifulsoup or lxml 我发现很难应用beautifulsoup或lxml

HTML page: HTML页面:

<li class="context-card">
    <div class="episode" data-id="t1">
        <span class="av-play">Title to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t2">
        <span class="av-play">Title2 to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t3">
        <span class="av-play">Title3 to scrape</span>
    </div>
</li>

How to get all these 3 Ids and titles in a different dictionary within a list 如何在列表中的不同词典中获取所有这3个ID和标题

[{'id':'t1', 'title': 'Title to scrape'}, {'id':'t2', 'title': 'Title2 to scrape'}, {'id':'t3', 'title': 'Title3 to scrape'}]

All the titles and IDs you need are located inside the <span> tag with the class="episode" attribute. 您需要的所有标题和ID都位于<span>标记内,具有class="episode"属性。 So, your job is to iterate over all of those tags and get the 'data-id' of the div tag and the text of its inner span tag. 因此,您的工作是遍历所有这些标签并获取div标签的'data-id'及其内部span标签的text

Code: 码:

html = '''
<li class="context-card">
    <div class="episode" data-id="t1">
        <span class="av-play">Title to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t2">
        <span class="av-play">Title2 to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t3">
        <span class="av-play">Title3 to scrape</span>
    </div>
</li>
'''
soup = BeautifulSoup(html, 'lxml')

title_list = []
for ep in soup.find_all('div', class_='episode'):
    curr_dict = {'id': ep['data-id'], 'title': ep.span.text}
    title_list.append(curr_dict)

print(title_list)

Output: 输出:

[{'id': 't1', 'title': 'Title to scrape'},
 {'id': 't2', 'title': 'Title2 to scrape'},
 {'id': 't3', 'title': 'Title3 to scrape'}]

Or, the same can be done using a list comprehension: 或者,可以使用列表推导完成相同操作:

title_list = [{'id': ep['data-id'], 'title': ep.span.text} for ep in soup.find_all('div', class_='episode')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM