繁体   English   中英

在下拉列表中的每个选项内抓取迭代选择的 html 数据项

[英]Scraping iteratively selected html data items inside each option in dropdown list

我正在尝试从 html 页面中抓取一些项目。 并且必须从下拉列表中选择选项,然后进行迭代。 但我总是从下拉列表中的第一个选项中获取项目。 我猜是因为我的点击 function 无法正常工作。 如何遍历所有选项和 select 项目以创建数据

import pandas as pd
from selenium import webdriver
import re
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
service = Service("/home/ubuntu/selenium_drivers/chromedriver")

base_url = 'https://www.crave.ca/en/tv-shows/16-and-pregnant'
page_one = True
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(service=service, options=options)
driver.get(base_url)
driver.find_element(By.XPATH,'//*[@id="dropdown-basic"]').click()
time.sleep(5)
total_seasons = driver.find_elements(By.CSS_SELECTOR,'button.dropdown-item')
driver.find_element(By.XPATH,'//*[@id="dropdown-basic"]').click()
print(len(total_seasons))
d=[]
for i in range(0,len(total_seasons)):
    alleps = driver.find_elements(By.XPATH,'//*[@id="episodes"]/div/ul/li')
    for j in range(1,len(alleps)+1):

        d.append({
            
            'Duration ': driver.find_element(By.XPATH,f'//*[@id="episodes"]/div/ul/li[{j}]/div[1]/div[2]/span/span[1]').text,
            'Episode_Number ': j,
            'Episode_Synopsis ': driver.find_element(By.XPATH,f'//*[@id="episodes"]/div/ul/li[{j}]/div[1]/div[2]/p').text,
            'Episode_Title ': re.sub(r'[^a-zA-Z ]+','',driver.find_element(By.XPATH,f'//*[@id="episodes"]/div/ul/li[{j}]/div[1]/div[2]/h3').text).strip(),
            
        })
data = pd.DataFrame.from_dict(d)

您正在单击此元素两次:

driver.find_element(By.XPATH,'//*[@id="dropdown-basic"]').click()

因此,您正在打开下拉菜单并将其关闭。 你永远不会选择其他季节。
为了使您的代码更好地工作,您应该首先抓取 Season1 数据而不选择其他季节,然后遍历其他季节,逐个选择它们并抓取它们的数据。
您的代码可能是这样的:

import pandas as pd
from selenium import webdriver
import re
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
service = Service("/home/ubuntu/selenium_drivers/chromedriver")

base_url = 'https://www.crave.ca/en/tv-shows/16-and-pregnant'
page_one = True
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(service=service, options=options)
driver.get(base_url)
driver.find_element(By.XPATH,'//*[@id="dropdown-basic"]').click()
time.sleep(1)
total_seasons = driver.find_elements(By.CSS_SELECTOR,'button.dropdown-item')
driver.find_element(By.XPATH,'//*[@id="dropdown-basic"]').click()
print(len(total_seasons))
d=[]
for i in range(len(total_seasons)):
    alleps = driver.find_elements(By.XPATH,'//*[@id="episodes"]/div/ul/li')
    for j in range(1,len(alleps)+1):

        d.append({
            
            'Duration ': driver.find_element(By.XPATH,f'//*[@id="episodes"]/div/ul/li[{j}]/div[1]/div[2]/span/span[1]').text,
            'Episode_Number ': j,
            'Episode_Synopsis ': driver.find_element(By.XPATH,f'//*[@id="episodes"]/div/ul/li[{j}]/div[1]/div[2]/p').text,
            'Episode_Title ': re.sub(r'[^a-zA-Z ]+','',driver.find_element(By.XPATH,f'//*[@id="episodes"]/div/ul/li[{j}]/div[1]/div[2]/h3').text).strip(),
            
        })
    driver.find_element(By.XPATH,'//*[@id="dropdown-basic"]').click()
    seasons = driver.find_elements(By.CSS_SELECTOR,'button.dropdown-item')
    seasons[i].click()
    time.sleep(1)

data = pd.DataFrame.from_dict(d)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM