[英]Beautiful Soup - selecting text of next span element with no class
I am attempting to use Beautiful Soup to scrape movie quotes from rottentomatoes.com.我正在尝试使用 Beautiful Soup 从 rottentomatoes.com 上抓取电影台词。 The page source is interesting in that the quotes are directly proceeded by a span class "bold quote_actor", but the quote itself is in a span with no class, eg ( https://www.rottentomatoes.com/m/happy_gilmore/quotes/ ): screenshot of web source
页面源很有趣,因为引号直接由跨度类“bold quote_actor”处理,但引号本身在没有类的跨度中,例如( https://www.rottentomatoes.com/m/happy_gilmore/quotes / ):网页源码截图
I would like to use Beautiful Soup's find_all to capture all quotes, without the actor's name.我想使用 Beautiful Soup 的 find_all 来捕获所有引号,而没有演员的名字。 I have tried many things with no success, such as:
我尝试了很多事情都没有成功,例如:
moviequotes = soup(input) for t in web_soup.findAll('span', {'class':'bold quote_actor'}): for item in t.parent.next_siblings: if isinstance(item, Tag): if 'class' in item.attrs and 'name' in item.attrs['class']: break print (item)
I would greatly appreciate any tips for how to navigate this code and to define the resulting plain text quotes into an object I use use with Pandas, etc.我将非常感谢有关如何导航此代码并将生成的纯文本引号定义为我与 Pandas 等一起使用的对象的任何提示。
I'm using CSS selectors to find the spans
which contain quotes: div span + span
.我正在使用 CSS 选择器来查找包含引号的
spans
: div span + span
。 This finds any span
element that is inside a div
and has a direct sibling element of type span
.这将查找
div
内的任何span
元素,并且具有类型为span
的直接兄弟元素。
This way I also get the span
s that contain actor names, so I filter them out by checking if they have a class
or style
attribute.通过这种方式,我还获得了包含演员姓名的
span
,因此我通过检查它们是否具有class
或style
属性来过滤掉它们。
import bs4
import requests
url = 'https://www.rottentomatoes.com/m/happy_gilmore/quotes/'
page = requests.get(url).text
soup = bs4.BeautifulSoup(page, 'lxml')
# CSS selector
selector = 'div span + span'
# find all the span elements which are a descendant of a div element
# and are a direct sibling of another span element
quotes = soup.select(selector)
# now filter out the elements with actor names
data = []
for q in quotes:
# only keep elements that don't have a class or style attribute
if not (q.has_attr('class') or q.has_attr('style')):
data.append(q)
for d in data:
print(d.text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.