简体   繁体   中英

soup.findAll() return null for div class attribute Beautifulsoup

I have been working on this problem for the last 10 hours and I am still unable to solve it. The code works for some people, but it is not working for me.

The main purpose is to extract Google results URL for all pages for https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0

And here is my code:

# -*- coding: utf-8
from bs4 import BeautifulSoup
import urllib, urllib2

def google_scrape(query):
    address = "https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0".format (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/43.0.1'})
    urlfile = urllib2.urlopen(request)
    html = urlfile.read()
    soup = BeautifulSoup(html)
    linkdictionary = {}

    for li in soup.findAll('div', attrs={'class' : 'g'}): # It never goes inside this for loop as find.All results Null

        sLink = li.find('.r a')
        print sLink['href']

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('beautifulsoup')
    print links

I am getting {} as a result.The code soup.findAll('div', attrs={'class' : 'g'}) is returning null and therefore, I am unable to scrape any results.

I am using BS4 and Python 2.7. Please help me as to why the code is not working properly. Any help would be much appreciated.

Further, it would be great if someone can give an insight as to why does the same code works for some people and not for others ? (Happened to me last time as well). Thanks.

this is an example of what you can do. you need selenium and phantomjs (this simulate a browser)

import selenium.webdriver
from pprint import pprint
import re 

url = 'https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0'
driver = selenium.webdriver.PhantomJS()
driver.get(url)
html =  driver.page_source


regex = r"<cite>(https:\/\/www\.focusonfurniture\.com\.au\/[\/A-Z]+)<\/cite>"

result = re.findall(re.compile(regex, re.IGNORECASE | re.MULTILINE),html)
for url in result:
    print url

driver.quit()

result :

https://www.focusonfurniture.com.au/delivery/
https://www.focusonfurniture.com.au/terms/
https://www.focusonfurniture.com.au/disclaimer/
https://www.focusonfurniture.com.au/dining/
https://www.focusonfurniture.com.au/bedroom/
https://www.focusonfurniture.com.au/catalogue/
https://www.focusonfurniture.com.au/mattresses/
https://www.focusonfurniture.com.au/clearance/
https://www.focusonfurniture.com.au/careers/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM