简体   繁体   English

无法获取来自多个页面的所有链接不变的URL

[英]Unable to get all links from multiple pages unchanging url

I want to get all links from 10 pages but i am unable to click the second page link. 我想从10页中获取所有链接,但是我无法单击第二页链接。 From the url https://10times.com/search?cx=partner-pub-8525015516580200%3Avtujn0s4zis&cof=FORid%3A10&ie=ISO-8859-1&q=%22Private+Equity%22&searchtype=All 从URL https://10times.com/search?cx=partner-pub-8525015516580200%3Avtujn0s4zis&cof=FORid%3A10&ie=ISO-8859-1&q=%22Private+Equity%22&searchtype=All

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import  bs4

from selenium import webdriver
import time

url = "https://10times.com/search?cx=partner-pub-8525015516580200%3Avtujn0s4zis&cof=FORid%3A10&ie=ISO-8859-1&q=%22Private+Equity%22&searchtype=All"
driver = webdriver.Chrome("C:\\Users\Ritesh\PycharmProjects\BS\drivers\chromedriver.exe")
driver.get(url)

def getnames(driver):
    soup = bs4.BeautifulSoup(driver.page_source, 'lxml')
    sink = soup.find("div", {"class": "gsc-results gsc-webResult"})
    links = sink.find_all('a')
    for link in links:
        try:
            print(link['href'])
        except:
            print("")

while True:
    getnames(driver)
    time.sleep(5)
    nextpage = driver.find_element_by_link_text("2")
    nextpage.click()
    time.sleep(2)

Please help me in solving this issue. 请帮助我解决这个问题。

You will need to use selenium, since you have dynamic elements in the page. 由于页面中包含动态元素,因此您将需要使用硒。 The code below will get all the links from each page: 下面的代码将获取每个页面的所有链接:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
import time

url = "https://10times.com/search?cx=partner-pub-8525015516580200%3Avtujn0s4zis&cof=FORid%3A10&ie=ISO-8859-1&q=%22Private+Equity%22&searchtype=All"
driver = webdriver.Chrome("C:\\Users\Ritesh\PycharmProjects\BS\drivers\chromedriver.exe")
driver.get(url)

WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, """//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]/div/div/div[2]/div[11]/div""")))


pages_links = driver.find_elements_by_xpath("""//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]/div/div/div[2]/div[11]/div/div""")

all_urls = []

for page_index in range(len(pages_links)):

    WebDriverWait(driver, 20).until(
 EC.presence_of_element_located((By.XPATH, """//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]/div/div/div[2]/div[11]/div""")))

    pages_links = driver.find_elements_by_xpath("""//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]/div/div/div[2]/div[11]/div/div""")

    page_link = pages_links[page_index]
    print "getting links for page: ", page_link.text

    page_link.click()

    time.sleep(1)


    #wait untill all links are loaded
    WebDriverWait(driver, 20).until(
  EC.presence_of_element_located((By.XPATH, """//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]""")))

    first_link = driver.find_element_by_xpath("""//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]/div/div/div[1]/div[1]/div[1]/div/a""")

    results_links = driver.find_elements_by_xpath("""//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]/div/div/div[2]/div/div[1]/div[1]/div/a""")

    urls = [first_link.get_attribute("data-cturl")] + [l.get_attribute("data-cturl") for l in results_links]

    all_urls = all_urls + urls


driver.quit()

You can use this code as it is or try to combined with the one you already have. 您可以按原样使用此代码,也可以尝试与已有的代码结合使用。

Note it does not consider the ad links, since I don't think you will need them, right? 请注意,它不考虑广告链接,因为我认为您不需要它们,对吗?

Let me know if this helps. 让我知道是否有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM