简体   繁体   中英

Unable to scrape many questions from a Quora webpage

I am learning BeautifulSoup and trying to scrape links of different questions that are present on this Quora page.

As I scroll down the website, questions present in the webpage keep coming up and displayed.

When I try to scrape the links to these questions using the code below, I only get,in my case, 5 links. ie - I only get links of 5 questions even though there are lot of questions on the site.

Is there any workaround to get as many links of questions present in the webpage?

from bs4 import BeautifulSoup
import requests

root = 'https://www.quora.com/topic/Graduate-Record-Examination-GRE-1'
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.' }
r = requests.get(root,headers=headers)

soup = BeautifulSoup(r.text,'html.parser')

q = soup.find('div',{'class':'paged_list_wrapper'})
no=0
for i in q.find_all('div',{'class':'story_title_container'}):
    t=i.a['href']
    no=no+1
    print(root+t,'\n\n')

What you are trying to accomplish cannot be done with Requests and BeautifulSoup. You need to use Selenium. Here i give the answer using selenium and chromedriver. Download chromedriver for you chrome version and install selenium pip install -U selenium

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import csv
browser = webdriver.Chrome(executable_path='/path/to/chromedriver')
browser.get("https://www.quora.com/topic/Graduate-Record-Examination-GRE-1")
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 5
while no_of_pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.2)
    no_of_pagedowns-=1
post_elems =browser.find_elements_by_xpath("//a[@class='question_link']")
for post in post_elems:
    print(post.get_attribute("href"))

If you are using windows - executable_path='/path/to/chromedriver.exe'

change this variable no_of_pagedowns = 5 to specify how many times you want to scroll down.

I got the following output

在此处输入图像描述

The title is grabbed from the page and printed after formatting. This is one way to do it i'm sure there are many ways to do this and this only does one question.

import requests
from bs4 import BeautifulSoup

URL = "https://www.quora.com/Which-Deep-Learning-online-course-is-better-Coursera-specialization-VS-Udacity-Nanodegree-vs-FAST-ai"

response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')

# grabs the text in the title
question = soup.select_one('title').text
# removes - quora at the end
x = slice(-8) 

print(question[x])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM