繁体   English   中英

使用漂亮的汤python解析Google新闻

[英]parsing google news using beautiful soup python

我有如下的python代码。 它搜索一个Google新闻页面,并为每个新闻打印超链接和标题。 我的问题是,googlenews将在一个存储桶中相似的新闻分组,而在下面的脚本中,每个存储桶中仅打印第一条新闻。 如何从所有存储桶中打印所有新文件?

from bs4 import BeautifulSoup
import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

#r = requests.get('http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts', headers=headers)
r = requests.get('https://www.google.com/search?q=%22eric+bledsoe%22&tbm=nws&tbs=qdr:d', headers=headers)
r = requests.get('https://www.google.com/search?q=%22lebron+james%22&tbm=nws&tbs=qdr:y', headers=headers)

soup = BeautifulSoup(r.text, "html.parser")

letters = soup.find_all("div", class_="_cnc")
#print soup.prettify() 
#print letters
print type(letters)
print len(letters)
print("\n")

for x in range(0, len(letters)):
    print x
    print letters[x].a["href"]


print("\n")

letters2 = soup.find_all("a", class_="l _HId")
for x in range(0, len(letters2)):
    print x
    print letters2[x].get_text()

print ("\n----------content")
#print letters[0]

通过分类新闻,我的意思是在下图中将前几个新闻分组在一起。 新闻“勒布朗·詹姆斯将他的一个队友比作丹恩”是另一组的一部分。

在此处输入图片说明

我不知道您所说的“桶”是什么意思? 如果您是要解析多个网站,那么我可以通过向它发送多个新闻requests.get()来告诉您正在覆盖r requests.get()

这是一个处理urls数组中所有URL的循环。

import bs4
import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}


urls = ["https://www.google.com/search?q=%22eric+bledsoe%22&tbm=nws&tbs=qdr:d",
        "https://www.google.com/search?q=%22lebron+james%22&tbm=nws&tbs=qdr:y"]

ahrefs = []
titles = []

for url in urls:
    req = requests.get(url, headers=headers)
    soup = bs4.BeautifulSoup(req.text, "html.parser")

    #you don't even have to process the div container
    #just go strait to <a> and using indexing get "href"
    #headlines
    ahref  = [a["href"] for a in soup.find_all("a", class_="_HId")]
    #"buckets"
    ahref += [a["href"] for a in soup.find_all("a", class_="_sQb")]
    ahrefs.append(ahref)

    #or get_text() will return the array inside the hyperlink
    #the title you want
    title =  [a.get_text() for a in soup.find_all("a", class_="_HId")]
    title += [a.get_text() for a in soup.find_all("a", class_="_sQb")]
    titles.append(title)

#print(ahrefs)
#print(titles)

我在Google上搜索len(ahrefs[1]) == 18出现了18条结果,包括len(ahrefs[1]) == 18 ,并且len(ahrefs[1]) == 18

随着一个全新的转变,我决定更有效地解决这个问题,这样,您只需要追加查询来搜索新玩家。 我不确定最终结果是什么,但这将返回字典列表。

import bs4
import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}


#just add to this list for each new player
#player name : url
queries = {"bledsoe":"https://www.google.com/search?q=%22eric+bledsoe%22&tbm=nws&tbs=qdr:d",
           "james":"https://www.google.com/search?q=%22lebron+james%22&tbm=nws&tbs=qdr:y"}


total = []

for player in queries: #keys

    #request the google query url of each player
    req  = requests.get(queries[player], headers=headers)
    soup = bs4.BeautifulSoup(req.text, "html.parser")

    #look for the main container
    for each in soup.find_all("div"):
        results = {player: { \
            "link": None,    \
            "title": None,   \
            "source": None,  \
            "time": None}    \
        }

        try:
          #if <div> doesn't have class="anything"
          #it will throw a keyerror, just ignore

          if "_cnc" in each.attrs["class"]: #mainstories
            results[player]["link"] = each.find("a")["href"]
            results[player]["title"] = each.find("a").get_text()
            sourceAndTime = each.contents[1].get_text().split("-")
            results[player]["source"], results[player]["time"] = sourceAndTime
            total.append(results)

          elif "card-section" in each.attrs["class"]: #buckets
            results[player]["link"] = each.find("a")["href"]
            results[player]["title"] = each.find("a").get_text()
            results[player]["source"] = each.contents[1].contents[0].get_text()
            results[player]["time"] = each.contents[1].get_text()
            total.append(results)

        except KeyError:
            pass    

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM