https://www.bestbuy.com/site/promo/health-fitness-deals
I want to loop through these 10 pages and scrape their names and hrefs Below is my code which only scrapes the 1st page continuously 10 times:
def name():
for i in range(1, 11):
tag = driver.find_elements_by_xpath('/html/body/div[4]/main/div[9]/div/div/div/div/div/div/div[2]/div[2]/div[3]/div/div[5]/ol/li[3]/div/div/div/div/div/div[2]/div[1]/div[2]/div/h4')
for a in tag:
for name in a.find_elements_by_tag_name('a'):
links = name.get_attribute("href")
names = name.get_attribute('text')
watches_name.append(names)
watches_link.append(links)
# print(watches_name)
# print(watches_link)
name()
If you want to get elements from next pages then you have to click()
on link >
driver.find_element_by_css_selector('.sku-list-page-next').click()
Minimal working code with other changes.
I reduced xpath to something much simpler. And I keep name, link as pair because it is simpler to write in file CSV or in database or to filter and sort.
I had to use longer sleep
- sometimes my browser needs more time to update elements on page.
from selenium import webdriver
import time
url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()
items = []
for page in range(1, 11):
print('\n[DEBUG] wait 15 seconds to update page\n')
time.sleep(15)
print('\n--- page', page, '---\n')
all_links = driver.find_elements_by_css_selector('#main-results h4 a')
for a in all_links:
link = a.get_attribute("href")
name = a.get_attribute('text')
items.append( [name, link] )
print(name)
print('\n[DEBUG] click next\n')
driver.find_element_by_css_selector('.sku-list-page-next').click()
#print(items)
BTW:
This method could be done with while True
and some method to recognize if there is link >
- and exit loop when there is no >
. This way it could work with any number of pages.
Other method.
When you manually visit few pages then you should see that second page has url with ?cp=2
, third with ?cp=3
, etc. so you could use it to load pages
driver.get(url + '?cp=' + str(page+1) )
Minimal working code.
from selenium import webdriver
import time
url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()
items = []
for page in range(1, 11):
print('\n[DEBUG] wait 15 seconds to update page\n')
time.sleep(15)
print('\n--- page', page, '---\n')
all_links = driver.find_elements_by_css_selector('#main-results h4 a')
for a in all_links:
link = a.get_attribute("href")
name = a.get_attribute('text')
items.append( [name, link] )
print(name)
print('\n[DEBUG] load next url\n')
driver.get(url + '?cp=' + str(page+1) )
#print(items)
This method could also use while True
and variable page
to get any number of pages.
EDIT:
Versions with while True
from selenium import webdriver
import time
url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()
items = []
page = 1
while True:
print('\n[DEBUG] wait 15 seconds to update page\n')
time.sleep(15)
print('\n--- page', page, '---\n')
all_links = driver.find_elements_by_css_selector('#main-results h4 a')
for a in all_links:
link = a.get_attribute("href")
name = a.get_attribute('text')
items.append( [name, link] )
print(name)
page += 1
print('\n[DEBUG] load next url\n')
driver.get(url + '?cp=' + str(page) )
if driver.title == 'Best Buy: Page Not Found':
print('\n[DEBUG] exit loop\n')
break
#print(items)
and
from selenium import webdriver
import time
url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()
items = []
page = 1
while True:
print('\n[DEBUG] wait 15 seconds to update page\n')
time.sleep(15)
print('\n--- page', page, '---\n')
all_links = driver.find_elements_by_css_selector('#main-results h4 a')
for a in all_links:
link = a.get_attribute("href")
name = a.get_attribute('text')
items.append( [name, link] )
print(name)
page += 1
print('\n[DEBUG] click next\n')
item = driver.find_element_by_css_selector('.sku-list-page-next')
if item.get_attribute("href"):
item.click()
else:
print('\n[DEBUG] exit loop\n')
break
#print(items)
I guess if your code is working right, you will just need to click the pagination button. I found it can be located with the help of css selector ('#Caret_Right_Line_Sm')
. Try adding this line to your function:
def name():
for i in range(1, 11):
tag = driver.find_elements_by_xpath('/html/body/div[4]/main/div[9]/div/div/div/div/div/div/div[2]/div[2]/div[3]/div/div[5]/ol/li[3]/div/div/div/div/div/div[2]/div[1]/div[2]/div/h4')
for a in tag:
for name in a.find_elements_by_tag_name('a'):
links = name.get_attribute("href")
names = name.get_attribute('text')
watches_name.append(names)
watches_link.append(links)
# print(watches_name)
# print(watches_link)
driver.find_elements_by_css_selector('#Caret_Right_Line_Sm')[1].click()
name()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.