简体   繁体   中英

Selenium Python, parsing through website, opening a new tab, and scraping

I'm new at Python and Selenium. I'm trying to do something--which im sure im going in a very round-about way--any help is greatly appreciated.

The page im trying to parse through has different cards that need to be clicked on, i need to go to each card, and from there grab the name (h1) and the url. I havent gotten very far, and this is what i have so far.

I go through the first page, grab all the urls, add them to a list. Then i want to go through the list, and go to each url (opening a new tab) and from there grabbing the h1 and url. It doesn't seem like I'm even able to grab the h1, and it opens a new tab, then hangs, then opens the same tab.

Thank you in advance!

 from selenium import webdriver from selenium.webdriver.common.keys import Keys import time driver = webdriver.Chrome() driver.get('https://zdb.pedaily.cn/enterprise//') #main URL title_links = driver.find_elements_by_css_selector('ul.n4 a') urls = [] #list of URLs # main = driver.find_elements_by_id('enterprise-list') for item in title_links: urls.append(item.get_attribute('href')) # print(urls) for url in urls: driver.execute_script("window.open('');") driver.switch_to.window(driver.window_handles[1]) driver.get(url) print(driver.find_element_by_css_selector('div.info h1'))

Well, there are a few issues here:

  • You should be much more specific with your tag for grabbing urls. This is leading to multiple copies of the same url--that's why it is opening the same pages again.
  • You should give the site enough time to load before trying to grab objects, that may be why it's timing out but always good to be on the safe side before grabbing objects.
  • You have to shift focus back to the original page to continue iterating the list
  • You don't need to inject JS to open a new tab and use a py call to open , and JS formatting could be cleaner
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise/')  # main URL

# Be much more specific or you'll get multiple returns of the same link
urls = driver.find_elements(By.TAG_NAME, 'ul.n4 li div.img a')

for url in urls:
    # get href to print
    print(url.get_attribute('href'))
    # Inject JS to open new tab
    driver.execute_script("window.open(arguments[0])", url)
    # Switch focus to new tab
    driver.switch_to.window(driver.window_handles[1])
    # Make sure what we want has time to load and exists before trying to grab it
    WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.info h1')))
    # Grab it and print it's contents
    print(driver.find_element(By.CSS_SELECTOR, 'div.info h1').text)
    # Uncomment the next line to do one tab at a time. Will reduce speed but not use so much ram.
    #driver.close()
    # Focus back on first window
    driver.switch_to.window(driver.window_handles[0])
# Close window
driver.quit()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM