简体   繁体   English

Python-每天从没有任何提要的网站上抓取新闻文章

[英]Python- scraping news articles on daily basis from sites that do not have any feed

I can use Python Beautiful Soup module to extract news items from a site feed URL.我可以使用 Python Beautiful Soup模块从站点提要 URL 中提取新闻项。 But suppose the site has no feed and I need to extract news articles from it on daily basis as if it had a feed.但是假设该站点没有提要,我需要每天从中提取新闻文章,就像它有提要一样。

The site https://www.jugantor.com/ has no feed.该网站https://www.jugantor.com/没有提要。 Even by googling, I did not find any .即使通过谷歌搜索,我也没有找到任何 . With the following code snippet, I tried to extract the links from the site .使用以下代码片段,我尝试从站点中提取链接。 The result shows links such as ' http://epaper.jugantor.com '.结果显示诸如“ http://epaper.jugantor.com ”之类的链接。 But the news items appearing on the site are nor included in the extracted links.但网站上出现的新闻项目也不包含在提取的链接中。

My Code:我的代码:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re


def getLinks(url):

    USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'
    request = Request(url)
    request.add_header('User-Agent', USER_AGENT)
    response = urlopen(request)
    content = response.read().decode('utf-8')
    response.close()

    soup = BeautifulSoup(content, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))

    return links

print(getLinks("https://www.jugantor.com/"))

Obviously this does not serve the intended purpose.显然,这不符合预期目的。 I need all the news article links of ' https://www.jugantor.com/ ' on a daily basis as if I acquire them from a feed.我每天都需要“ https://www.jugantor.com/ ”的所有新闻文章链接,就好像我是从提要中获取它们一样。 I can use a cron job to run a script daily.我可以使用 cron 作业每天运行一个脚本。 But the challenge remains in identifying all articles published on a particular day and then extracting them.但挑战仍然在于识别特定日期发表的所有文章,然后提取它们。

How can I do that ?我怎样才能做到这一点 ? Any python module or algorithm etc ?任何python模块或算法等?

NB: A somewhat similar question exists here which does not mention the feed to be the the parsing source.It seems the OP there is concerned to extract articles from a page that lists them as a textual snapshot.注意: 这里存在一个有点类似的问题,它没有提到作为解析源的提要。似乎那里的 OP 关注从将文章列为文本快照的页面中提取文章。 Unlike that question, my question focuses on sites that do not have any feed.与那个问题不同,我的问题侧重于没有任何提要的网站。 And the only answer existing there does not address this issue however.然而,那里存在的唯一答案并没有解决这个问题。

I'm not sure to understand right, but first thing I saw is {'href': re.compile("^http://")} .我不确定是否理解正确,但我首先看到的是{'href': re.compile("^http://")}

You will miss all https and relative links.您将错过所有https和相关链接。 Relatives links could be skipped here without any problems (I guess..), but clearly not https ones.可以在这里跳过亲属链接而没有任何问题(我猜..),但显然不是https So first thing:所以第一件事:

{'href': re.compile("^https?://")}

Then, to avoid to download and parse same URL each day, you could extract the id of the article (in https://www.jugantor.com/lifestyle/19519/%E0%...A7%87 id is 19519 ), save this in database and so verify first if the id exist before scraping the page.然后,为了避免每天下载和解析相同的 URL,您可以提取文章的 id(在https://www.jugantor.com/lifestyle/19519/%E0%...A7%87 id is 19519 ) ,将其保存在数据库中,因此在抓取页面之前首先验证 id 是否存在。

Last thing, I'm not sure this will be useful, but this url https://www.jugantor.com/todays-paper/ makes me think you should be able to find only today's news.最后一件事,我不确定这是否有用,但是这个 url https://www.jugantor.com/todays-paper/让我觉得你应该只能找到今天的新闻。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM