简体   繁体   中英

web crawling Google - getting different results

I have written the following Python script, to crawl and scrape headings of Google News search results, within a specific date range. Though the script is working, it's showing the latest search results, and not the ones mentioned in the list.

Eg Rather than showing results from 1 Jul 2015 - 7 Jul 2015, the script is showing results from May 2016 (current month)

import urllib.request 
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

#get and read the URL
url = ("https://www.google.co.in/search?q=banking&num=100&safe=off&espv=2&biw=1920&bih=921&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F07%2F2015%2Ccd_max%3A07%2F07%2F2015&tbm=nws")
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
html = opener.open(url)
bsObj = BeautifulSoup(html.read(), "html5lib")


#extracts all the links from the given page 
itmes  = bsObj.findAll("h3")
for item in itmes:
    itemA = item.a
    theHeading = itemA.text
    print(theHeading)

Can someone please guide me to the correct method of getting the desired results, sorted by dates?

Thanks in advance.

I did some tests and it seems the problem is coming from the User-Agent which is not detailed enough. Try replacing this line:

opener.addheaders = [('User-agent', 'Mozilla/5.0')]

with:

opener.addheaders = [('User-agent', "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0"),

It worked for me. Of course this User-Agent is just an example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM