简体   繁体   English

从 Python 中的任何站点抓取链接标题的一般方法?

[英]A general way to scrape link titles from any site in Python?

Is there a "general" way to scrape link titles from any website in Python?是否有一种“通用”方法可以从 Python 中的任何网站抓取链接标题? For example, if I use the following code:例如,如果我使用以下代码:

from urllib.request import url open
from bs4 import BeautifulSoup

site = "https://news.google.com"
html = urlopen(site)
soup = BeautifulSoup(html.read(), 'lxml');

titles = soup.findAll('span', attrs = { 'class' : 'titletext' }) 
for title in titles:
    print(title.contents)

I am able to extract nearly every headline title from news.google.com.我能够从 news.google.com 中提取几乎所有标题标题。 However, if I use the same code at www.yahoo.com, I am unable to due to a different HTML formatting.但是,如果我在 www.yahoo.com 上使用相同的代码,由于 HTML 格式不同,我无法使用。

Is there a more general way to do this so that it works for most sites?有没有更通用的方法来做到这一点,以便它适用于大多数网站?

No, each site is different and if you make a more general scraper, it will get more data that isn't as specific as every headline title.不,每个网站都不同,如果您制作更通用的抓取工具,它将获得更多数据,但不像每个标题标题那样具体。

For instance the following would get every headline title from google and would also probably get them from yahoo also.例如,以下内容将从 google 获取每个标题,也可能从 yahoo 获取它们。

titles = soup.find_all('a') 
for title in titles:
    print(title.get_text())

However it would also get you all of the headers and other links which would muddy up your results.但是,它也会为您提供所有标题和其他链接,这会使您的结果变得混乱。 (there are approximately 150 links on that google page that aren't headlines) (该谷歌页面上大约有 150 个链接不是标题)

Not, that's why we need CSS selector and XPath, but if there are small number of page, there is a convenient way to do that:不是,这就是我们需要 CSS 选择器和 XPath 的原因,但是如果页面数量很少,有一种方便的方法来做到这一点:

site = "https://news.google.com"
if 'google' in site:
    filters = {'name':'span', "class" : 'titletext' }
elif 'yahoo' in site:
    filters = {'name':'blala', "class" : 'blala' }
titles = soup.findAll(**filters) 
for title in titles:
    print(title.contents)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM