简体   繁体   中英

Python- scraping news articles on daily basis from sites that do not have any feed

I can use Python Beautiful Soup module to extract news items from a site feed URL. But suppose the site has no feed and I need to extract news articles from it on daily basis as if it had a feed.

The site https://www.jugantor.com/ has no feed. Even by googling, I did not find any . With the following code snippet, I tried to extract the links from the site . The result shows links such as ' http://epaper.jugantor.com '. But the news items appearing on the site are nor included in the extracted links.

My Code:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re


def getLinks(url):

    USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'
    request = Request(url)
    request.add_header('User-Agent', USER_AGENT)
    response = urlopen(request)
    content = response.read().decode('utf-8')
    response.close()

    soup = BeautifulSoup(content, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))

    return links

print(getLinks("https://www.jugantor.com/"))

Obviously this does not serve the intended purpose. I need all the news article links of ' https://www.jugantor.com/ ' on a daily basis as if I acquire them from a feed. I can use a cron job to run a script daily. But the challenge remains in identifying all articles published on a particular day and then extracting them.

How can I do that ? Any python module or algorithm etc ?

NB: A somewhat similar question exists here which does not mention the feed to be the the parsing source.It seems the OP there is concerned to extract articles from a page that lists them as a textual snapshot. Unlike that question, my question focuses on sites that do not have any feed. And the only answer existing there does not address this issue however.

I'm not sure to understand right, but first thing I saw is {'href': re.compile("^http://")} .

You will miss all https and relative links. Relatives links could be skipped here without any problems (I guess..), but clearly not https ones. So first thing:

{'href': re.compile("^https?://")}

Then, to avoid to download and parse same URL each day, you could extract the id of the article (in https://www.jugantor.com/lifestyle/19519/%E0%...A7%87 id is 19519 ), save this in database and so verify first if the id exist before scraping the page.

Last thing, I'm not sure this will be useful, but this url https://www.jugantor.com/todays-paper/ makes me think you should be able to find only today's news.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM