I was trying to get all project titles and creator names by webscraping and most of it is working, but I got a "TimeoutException: Message:" when I was trying to scrape infinite scrolling pages with "load more" button. Please let me know what is wrong and what i need to correct. Thanks
Below is the code currently being used:
from bs4 import BeautifulSoup
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get("https://www.kickstarter.com/discover/advanced?sort=newest&seed=2695789&page=1/")
button = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,'bttn keyboard-focusable bttn-medium bttn-primary theme--create fill-bttn-icon hover-fill-bttn-icon')))
button.click()
names=[]
creators=[]
soup = BeautifulSoup(driver.page_source)
for a in soup.findAll('div',{'class':'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'}):
name=a.find('div', attrs={'class':'clamp-5 navy-500 mb3 hover-target'})
creator=a.find('div', attrs={'class':'type-13 flex'})
names.append(name.h3.text)
creators.append(creator.text)
df = pd.DataFrame({'Name':names,'Creator':creators})
You really need not to use the Beautiful Soup
and selenium
. Go for requests
library and its easy to grab it all hassle free.
import requests
import json
records = []
for i in range(5):
req = requests.get('https://www.kickstarter.com/discover/advanced?google_chrome_workaround&woe_id=0&sort=newest&seed=2695910&page='+str(i),
headers={'Accept': 'application/json',
'Content-Type': 'application/json'})
if(req.status_code == 200):
josn2 = req.json()
projects = josn2.get("projects")
for i in range(len(projects)):
print("Project Name - " + projects[i]['name'],end=' Created By - ')
print(projects[i]['creator'].get('name'))
print("----------------")
Output:
you can scroll down to the page as many time loadmore button loads the content put that much count in the for loop you will get all the content.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.