Beautiful Soup - 选择没有类的下一个跨度元素的文本

Question

我正在尝试使用 Beautiful Soup 从 rottentomatoes.com 上抓取电影台词。 页面源很有趣，因为引号直接由跨度类“bold quote_actor”处理，但引号本身在没有类的跨度中，例如（ https://www.rottentomatoes.com/m/happy_gilmore/quotes / )：网页源码截图

我想使用 Beautiful Soup 的 find_all 来捕获所有引号，而没有演员的名字。 我尝试了很多事情都没有成功，例如：

 moviequotes = soup(input) for t in web_soup.findAll('span', {'class':'bold quote_actor'}): for item in t.parent.next_siblings: if isinstance(item, Tag): if 'class' in item.attrs and 'name' in item.attrs['class']: break print (item)

我将非常感谢有关如何导航此代码并将生成的纯文本引号定义为我与 Pandas 等一起使用的对象的任何提示。

Answer 1

我正在使用 CSS 选择器来查找包含引号的spans ： div span + span 。 这将查找div内的任何span元素，并且具有类型为span的直接兄弟元素。

通过这种方式，我还获得了包含演员姓名的span ，因此我通过检查它们是否具有class或style属性来过滤掉它们。

import bs4
import requests

url  = 'https://www.rottentomatoes.com/m/happy_gilmore/quotes/'
page = requests.get(url).text
soup = bs4.BeautifulSoup(page, 'lxml')

# CSS selector
selector = 'div span + span'

# find all the span elements which are a descendant of a div element
# and are a direct sibling of another span element 
quotes = soup.select(selector)

# now filter out the elements with actor names
data = []

for q in quotes:
    # only keep elements that don't have a class or style attribute
    if not (q.has_attr('class') or q.has_attr('style')):
        data.append(q)

for d in data:
    print(d.text)

Beautiful Soup - 选择没有类的下一个跨度元素的文本

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-10-04 12:45:13

Beautiful Soup - 选择没有类的下一个跨度元素的文本

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-10-04 12:45:13

解决方案1
2 已采纳 2017-10-04 12:45:13