简体   繁体   English

Selenium Python,通过网站解析,打开一个新标签,并抓取

[英]Selenium Python, parsing through website, opening a new tab, and scraping

I'm new at Python and Selenium.我是 Python 和 Selenium 的新手。 I'm trying to do something--which im sure im going in a very round-about way--any help is greatly appreciated.我正在尝试做一些事情——我确信我会以一种非常迂回的方式进行——任何帮助都非常感谢。

The page im trying to parse through has different cards that need to be clicked on, i need to go to each card, and from there grab the name (h1) and the url.我试图解析的页面有不同的卡片需要点击,我需要转到每张卡片,然后从那里获取名称 (h1) 和 url。 I havent gotten very far, and this is what i have so far.我还没有走得很远,这就是我到目前为止所拥有的。

I go through the first page, grab all the urls, add them to a list.我浏览第一页,获取所有网址,将它们添加到列表中。 Then i want to go through the list, and go to each url (opening a new tab) and from there grabbing the h1 and url.然后我想浏览列表,并转到每个 url(打开一个新选项卡)并从那里获取 h1 和 url。 It doesn't seem like I'm even able to grab the h1, and it opens a new tab, then hangs, then opens the same tab.似乎我什至无法抓住 h1,它会打开一个新选项卡,然后挂起,然后打开同一个选项卡。

Thank you in advance!先感谢您!

 from selenium import webdriver from selenium.webdriver.common.keys import Keys import time driver = webdriver.Chrome() driver.get('https://zdb.pedaily.cn/enterprise//') #main URL title_links = driver.find_elements_by_css_selector('ul.n4 a') urls = [] #list of URLs # main = driver.find_elements_by_id('enterprise-list') for item in title_links: urls.append(item.get_attribute('href')) # print(urls) for url in urls: driver.execute_script("window.open('');") driver.switch_to.window(driver.window_handles[1]) driver.get(url) print(driver.find_element_by_css_selector('div.info h1'))

Well, there are a few issues here:嗯,这里有几个问题:

  • You should be much more specific with your tag for grabbing urls.您应该更具体地使用用于抓取网址的标签。 This is leading to multiple copies of the same url--that's why it is opening the same pages again.这会导致同一 url 的多个副本——这就是它再次打开相同页面的原因。
  • You should give the site enough time to load before trying to grab objects, that may be why it's timing out but always good to be on the safe side before grabbing objects.在尝试抓取对象之前,您应该给站点足够的时间来加载,这可能就是它超时的原因,但在抓取对象之前保持安全总是好的。
  • You have to shift focus back to the original page to continue iterating the list您必须将焦点移回原始页面才能继续迭代列表
  • You don't need to inject JS to open a new tab and use a py call to open , and JS formatting could be cleaner您不需要注入 JS 来打开一个新选项卡并使用 py 调用 open ,并且 JS 格式可能更清晰
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise/')  # main URL

# Be much more specific or you'll get multiple returns of the same link
urls = driver.find_elements(By.TAG_NAME, 'ul.n4 li div.img a')

for url in urls:
    # get href to print
    print(url.get_attribute('href'))
    # Inject JS to open new tab
    driver.execute_script("window.open(arguments[0])", url)
    # Switch focus to new tab
    driver.switch_to.window(driver.window_handles[1])
    # Make sure what we want has time to load and exists before trying to grab it
    WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.info h1')))
    # Grab it and print it's contents
    print(driver.find_element(By.CSS_SELECTOR, 'div.info h1').text)
    # Uncomment the next line to do one tab at a time. Will reduce speed but not use so much ram.
    #driver.close()
    # Focus back on first window
    driver.switch_to.window(driver.window_handles[0])
# Close window
driver.quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM