简体   繁体   English

如何通过在标签内寻找标签来提取标签中的信息?

[英]How to extract info in the tag by looking for a tag inside of that tag?

Say I want to extract 24 min per episode info or the N13 information under Rating. 假设我想提取每集24分钟信息或“评分”下的N13信息。 Now this is just part of the code, and some of the span tags hold not dark_text class but something else. 现在,这只是代码的一部分,并且某些span标签不dark_text类,而是包含其他内容。 But when I look for tags that hold say Rating, when I find it I can't extract what Rating it is, because N13 now is under div tag, not span , but since I'm looking for 'Rating' or 'Duration' I have to look for 'span' tag. 但是,当我寻找N13 “ Rating”的标签时,发现它时我无法提取它的“ Rating”,因为N13现在位于div标签下,而不是span ,但是由于我正在寻找“ Rating”或“ Duration”我必须寻找“ span”标签。 And Beautiful Soup doesn't allow you to do findAll('div').findAll('span', {'class':'...'}) ,so I can't get back to the div tag if it finds the span tag I'm looking for. 而且Beautiful Soup不允许您执行findAll('div').findAll('span', {'class':'...'}) ,因此如果找到了div标签,我将无法返回我正在寻找的span标签。

When I do a for loop it prints out all these additional None s, among other stuff. 当我执行for循环时,它会打印出所有这些其他None以及其他内容。 Anyone has any tips on how to parse this well? 有人对如何很好地解析有任何提示吗?

The question is really just how to look for something in <span> tag that is under div tag, but once located then extract the entire div tag, or maybe preferably even what is only in the div tag but not in the span tag? 问题实际上只是如何在div标签下的<span>标签中寻找东西,但是一旦找到,则提取整个div标签,或者甚至最好是仅在div标签中但不在span标签中的东西? This has turned out to be more complicated than I anticipated. 事实证明,这比我预期的要复杂。

from bs4 import BeautifulSoup
x= '''<div>
<a href="javascript:void(0);" onclick="$('#score143583').toggle()">Overall Rating</a>:
    2
  </div>
  <div class="spaceit">
  <span class="dark_text">Duration:</span>
    24 min. per ep.
    </div>
  <div>
  <span class="dark_text">Rating:</span>
    N13
    </div>'''


bs = BeautifulSoup(x, 'html.parser')

You can use the next_sibling method to get the text that is located immediately after the span tag. 您可以使用next_sibling方法来获取位于span标记之后的文本。 To get the span tag you can use find('span', class_='dark_text', text='Duration:') . 要获取span标签,可以使用find('span', class_='dark_text', text='Duration:')

Creating a simple function, you can use this: 创建一个简单的函数,您可以使用此函数:

def get_next_text(soup, text):
    return soup.find('span', class_='dark_text', text=text).next_sibling

soup = BeautifulSoup(html, 'lxml')
duration = get_next_text(soup, 'Duration:')
print('Duration:', duration.strip())
rating = get_next_text(soup, 'Rating:')
print('Rating:', rating.strip())

Output: 输出:

Duration: 24 min. per ep.
Rating: N13

If you want to get the whole div tag that contains the text you want, you can use .parent . 如果要获取包含所需文本的整个div标签,则可以使用.parent

def get_parent(soup, text):
    return soup.find('span', class_='dark_text', text=text).parent

soup = BeautifulSoup(html, 'lxml')
duration = get_parent(soup, 'Duration:')
print(duration)
rating = get_parent(soup, 'Rating:')
print(rating)

Output: 输出:

<div class="spaceit">
<span class="dark_text">Duration:</span>
    24 min. per ep.
</div>
<div>
<span class="dark_text">Rating:</span>
    N13
</div>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM