簡體   English   中英

如何使用 Python 中的 Selenium 從 th 和 td 標簽中提取數據?

[英]How to extract data from both th and td tags using Selenium in Python?

所以我試圖從 DOL 網站上為一個使用 selenium 和 python 的項目抓取數據。 我正在嘗試抓取要組合到數據框中的列數據。 問題是前兩列編碼在<th>標簽下,因此在嘗試提取此數據時,xpath 命令不起作用。 我真的需要幫助。 我一直在絞盡腦汁,到處搜索,我找不到任何解決這個問題的地方。 請幫忙。

   <tr>
   <th id="Alabama" align="left">Alabama</th>
   <th id="01/04/2020" align="right">01/04/2020</th>
   <td headers="Alabama 01/04/2020 initial_claims" align="right">4,578</td>
   <td headers="Alabama 01/04/2020 reflecting_week_ended" align="right">12/28/2019</td>
   <td headers="Alabama 01/04/2020 continued_claims" align="right">18,523</td>
   <td headers="Alabama 01/04/2020 covered_employment" align="right">1,923,741</td>
   <td headers="Alabama 01/04/2020 insured_unemployment" align="right">0.96</td>
   </tr>
   from selenium import webdriver
   from webdriver_manager.chrome import ChromeDriverManager
   from selenium.webdriver.support.select import Select
   from selenium.webdriver.common.action_chains import ActionChains
   
   url = 'https://oui.doleta.gov/unemploy/claims.asp'
   driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe")
   
   driver.implicitly_wait(10)
   driver.get(url)
   driver.find_element_by_css_selector('input[name="level"][value="state"]').click()
   Select(driver.find_element_by_name('strtdate')).select_by_value('2020')
   Select(driver.find_element_by_name('enddate')).select_by_value('2022')
   driver.find_element_by_css_selector('input[name="filetype"][value="html"]').click()
   select = Select(driver.find_element_by_id('states'))

   # Iterate through and select all states
   for opt in select.options:
       opt.click()
   input('Press ENTER to submit the form')
   driver.find_element_by_css_selector('input[name="submit"][value="Submit"]').click()

   headers = []
   heads = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[2]/th')

   #Collect headers
   for h in heads:
       headers.append(h.text)

   rows = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr')
   
   # Get row count
   row_count = len(rows) 

   cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th/td')
   # Get column count
   col_count = len(cols)

我試過這段代碼

   cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th' and '//* [@id="content"]/table/tbody/tr[3]/td')

按照建議。 但是,它仍然只拉了 5 列,但是從上面的 HTML 可以看出,有 7 列。 我都需要它們。 請幫忙?

您可以使用 xpath 中的*name()從所有 7 列中提取數據。 xpath 如下所示。

rows = driver.find_elements_by_xpath("//table/tbody/tr")

cols = row.find_elements_by_xpath("./*") # Gets all the columns element within the element row. Use a Dot in the xpath to find elements within an element.
Or 
cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']") # Gets all the column elements with tag name "th" or "td" within the element row.

嘗試如下:

# Get the rows
rows = driver.find_elements_by_xpath("//table/tbody/tr")

# Iterate over the rows
for row in rows:
    # Get all the columns for each row. 
    # cols = row.find_elements_by_xpath("./*")
    cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']")
    temp = [] # Temproary list
    for col in cols:
        temp.append(col.text)
    print(temp)
['']
['State', 'Filed week ended', 'Initial Claims', 'Reflecting Week Ended', 'Continued Claims', 'Covered Employment', 'Insured Unemployment Rate']
['Alabama', '01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']
['Alabama', '01/11/2020', '3,629', '01/04/2020', '21,143', '1,923,741', '1.10']
['Alabama', '01/18/2020', '2,483', '01/11/2020', '17,402', '1,923,741', '0.90']
...

要從<th><td>標記中抓取數據,您可以使用List Comprehension並且可以使用以下Locator Strategies

  • 代碼塊:

     driver.get("https://oui.doleta.gov/unemploy/claims.asp") WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"input[value='state']"))).click() Select(driver.find_element_by_name('strtdate')).select_by_value('2020') Select(driver.find_element_by_name('enddate')).select_by_value('2022') Select(driver.find_element_by_id('states')).select_by_visible_text('Alabama') driver.find_element_by_css_selector("input[value='Submit']").click() # To print all the texts from the first row print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3)"))).text) print("*****") # To create a List with all the texts from the first row using List Comprehension print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3) [align='right']")))]) driver.quit()
  • 控制台 Output:

     Alabama 01/04/2020 4,578 12/28/2019 18,523 1,923,741 0.96 ***** ['01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM