[英]How to extract data from both th and td tags using Selenium in Python?
所以我試圖從 DOL 網站上為一個使用 selenium 和 python 的項目抓取數據。 我正在嘗試抓取要組合到數據框中的列數據。 問題是前兩列編碼在<th>
標簽下,因此在嘗試提取此數據時,xpath 命令不起作用。 我真的需要幫助。 我一直在絞盡腦汁,到處搜索,我找不到任何解決這個問題的地方。 請幫忙。
<tr>
<th id="Alabama" align="left">Alabama</th>
<th id="01/04/2020" align="right">01/04/2020</th>
<td headers="Alabama 01/04/2020 initial_claims" align="right">4,578</td>
<td headers="Alabama 01/04/2020 reflecting_week_ended" align="right">12/28/2019</td>
<td headers="Alabama 01/04/2020 continued_claims" align="right">18,523</td>
<td headers="Alabama 01/04/2020 covered_employment" align="right">1,923,741</td>
<td headers="Alabama 01/04/2020 insured_unemployment" align="right">0.96</td>
</tr>
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.action_chains import ActionChains
url = 'https://oui.doleta.gov/unemploy/claims.asp'
driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe")
driver.implicitly_wait(10)
driver.get(url)
driver.find_element_by_css_selector('input[name="level"][value="state"]').click()
Select(driver.find_element_by_name('strtdate')).select_by_value('2020')
Select(driver.find_element_by_name('enddate')).select_by_value('2022')
driver.find_element_by_css_selector('input[name="filetype"][value="html"]').click()
select = Select(driver.find_element_by_id('states'))
# Iterate through and select all states
for opt in select.options:
opt.click()
input('Press ENTER to submit the form')
driver.find_element_by_css_selector('input[name="submit"][value="Submit"]').click()
headers = []
heads = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[2]/th')
#Collect headers
for h in heads:
headers.append(h.text)
rows = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr')
# Get row count
row_count = len(rows)
cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th/td')
# Get column count
col_count = len(cols)
我試過這段代碼
cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th' and '//* [@id="content"]/table/tbody/tr[3]/td')
按照建議。 但是,它仍然只拉了 5 列,但是從上面的 HTML 可以看出,有 7 列。 我都需要它們。 請幫忙?
您可以使用 xpath 中的*
或name()
從所有 7 列中提取數據。 xpath 如下所示。
rows = driver.find_elements_by_xpath("//table/tbody/tr")
cols = row.find_elements_by_xpath("./*") # Gets all the columns element within the element row. Use a Dot in the xpath to find elements within an element.
Or
cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']") # Gets all the column elements with tag name "th" or "td" within the element row.
嘗試如下:
# Get the rows
rows = driver.find_elements_by_xpath("//table/tbody/tr")
# Iterate over the rows
for row in rows:
# Get all the columns for each row.
# cols = row.find_elements_by_xpath("./*")
cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']")
temp = [] # Temproary list
for col in cols:
temp.append(col.text)
print(temp)
['']
['State', 'Filed week ended', 'Initial Claims', 'Reflecting Week Ended', 'Continued Claims', 'Covered Employment', 'Insured Unemployment Rate']
['Alabama', '01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']
['Alabama', '01/11/2020', '3,629', '01/04/2020', '21,143', '1,923,741', '1.10']
['Alabama', '01/18/2020', '2,483', '01/11/2020', '17,402', '1,923,741', '0.90']
...
要從<th>
和<td>
標記中抓取數據,您可以使用List Comprehension並且可以使用以下Locator Strategies :
代碼塊:
driver.get("https://oui.doleta.gov/unemploy/claims.asp") WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"input[value='state']"))).click() Select(driver.find_element_by_name('strtdate')).select_by_value('2020') Select(driver.find_element_by_name('enddate')).select_by_value('2022') Select(driver.find_element_by_id('states')).select_by_visible_text('Alabama') driver.find_element_by_css_selector("input[value='Submit']").click() # To print all the texts from the first row print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3)"))).text) print("*****") # To create a List with all the texts from the first row using List Comprehension print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3) [align='right']")))]) driver.quit()
控制台 Output:
Alabama 01/04/2020 4,578 12/28/2019 18,523 1,923,741 0.96 ***** ['01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.