[英]How to extract info in the tag by looking for a tag inside of that tag?
Say I want to extract 24 min per episode info or the N13 information under Rating. 假设我想提取每集24分钟信息或“评分”下的N13信息。 Now this is just part of the code, and some of the span tags hold not dark_text
class but something else. 现在,这只是代码的一部分,并且某些span标签不dark_text
类,而是包含其他内容。 But when I look for tags that hold say Rating, when I find it I can't extract what Rating it is, because N13
now is under div
tag, not span
, but since I'm looking for 'Rating' or 'Duration' I have to look for 'span' tag. 但是,当我寻找N13
“ Rating”的标签时,发现它时我无法提取它的“ Rating”,因为N13
现在位于div
标签下,而不是span
,但是由于我正在寻找“ Rating”或“ Duration”我必须寻找“ span”标签。 And Beautiful Soup doesn't allow you to do findAll('div').findAll('span', {'class':'...'})
,so I can't get back to the div
tag if it finds the span
tag I'm looking for. 而且Beautiful Soup不允许您执行findAll('div').findAll('span', {'class':'...'})
,因此如果找到了div
标签,我将无法返回我正在寻找的span
标签。
When I do a for
loop it prints out all these additional None
s, among other stuff. 当我执行for
循环时,它会打印出所有这些其他None
以及其他内容。 Anyone has any tips on how to parse this well? 有人对如何很好地解析有任何提示吗?
The question is really just how to look for something in <span>
tag that is under div
tag, but once located then extract the entire div
tag, or maybe preferably even what is only in the div
tag but not in the span
tag? 问题实际上只是如何在div
标签下的<span>
标签中寻找东西,但是一旦找到,则提取整个div
标签,或者甚至最好是仅在div
标签中但不在span
标签中的东西? This has turned out to be more complicated than I anticipated. 事实证明,这比我预期的要复杂。
from bs4 import BeautifulSoup
x= '''<div>
<a href="javascript:void(0);" onclick="$('#score143583').toggle()">Overall Rating</a>:
2
</div>
<div class="spaceit">
<span class="dark_text">Duration:</span>
24 min. per ep.
</div>
<div>
<span class="dark_text">Rating:</span>
N13
</div>'''
bs = BeautifulSoup(x, 'html.parser')
You can use the next_sibling
method to get the text that is located immediately after the span
tag. 您可以使用next_sibling
方法来获取位于span
标记之后的文本。 To get the span
tag you can use find('span', class_='dark_text', text='Duration:')
. 要获取span
标签,可以使用find('span', class_='dark_text', text='Duration:')
。
Creating a simple function, you can use this: 创建一个简单的函数,您可以使用此函数:
def get_next_text(soup, text):
return soup.find('span', class_='dark_text', text=text).next_sibling
soup = BeautifulSoup(html, 'lxml')
duration = get_next_text(soup, 'Duration:')
print('Duration:', duration.strip())
rating = get_next_text(soup, 'Rating:')
print('Rating:', rating.strip())
Output: 输出:
Duration: 24 min. per ep.
Rating: N13
If you want to get the whole div
tag that contains the text you want, you can use .parent
. 如果要获取包含所需文本的整个div
标签,则可以使用.parent
。
def get_parent(soup, text):
return soup.find('span', class_='dark_text', text=text).parent
soup = BeautifulSoup(html, 'lxml')
duration = get_parent(soup, 'Duration:')
print(duration)
rating = get_parent(soup, 'Rating:')
print(rating)
Output: 输出:
<div class="spaceit">
<span class="dark_text">Duration:</span>
24 min. per ep.
</div>
<div>
<span class="dark_text">Rating:</span>
N13
</div>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.