I have written the following Python script, to crawl and scrape headings of Google News search results, within a specific date range. Though the script is working, it's showing the latest search results, and not the ones mentioned in the list.
Eg Rather than showing results from 1 Jul 2015 - 7 Jul 2015, the script is showing results from May 2016 (current month)
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
#get and read the URL
url = ("https://www.google.co.in/search?q=banking&num=100&safe=off&espv=2&biw=1920&bih=921&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F07%2F2015%2Ccd_max%3A07%2F07%2F2015&tbm=nws")
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
html = opener.open(url)
bsObj = BeautifulSoup(html.read(), "html5lib")
#extracts all the links from the given page
itmes = bsObj.findAll("h3")
for item in itmes:
itemA = item.a
theHeading = itemA.text
print(theHeading)
Can someone please guide me to the correct method of getting the desired results, sorted by dates?
Thanks in advance.
I did some tests and it seems the problem is coming from the User-Agent which is not detailed enough. Try replacing this line:
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
with:
opener.addheaders = [('User-agent', "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0"),
It worked for me. Of course this User-Agent is just an example.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.