简体   繁体   English

为什么Selenium仅获取页面上第一个工具提示的文本?

[英]Why is Selenium only fetching the text of the first ToolTip on the page?

As part of a larger webscraper built with Python, Selenium, and BeautifulSoup, I'm trying to get the text of all the tooltips on this page: https://www.legis.state.pa.us/CFDocs/Legis/BS/bs_action.cfm?SessId=20190&Sponsors=S|44|0|Katie%20J.%20Muth 作为使用Python,Selenium和BeautifulSoup构建的更大的Webscraper的一部分,我正在尝试获取此页面上所有工具提示的文本: https : //www.legis.state.pa.us/CFDocs/Legis/BS /bs_action.cfm?SessId=20190&Sponsors=S|44|0|Katie%20J.%20Muth

My current code is successfully fetching all the links and mousing over each link--when I run it, I see each tooltip pop up in succession. 我当前的代码成功获取了所有链接并在每个链接上进行了鼠标移动,当我运行它时,我看到每个工具提示都连续弹出。 However, it's only outputting the text of the very first tooltip. 但是,它仅输出第一个工具提示的文本。 I have no idea why! 我不知道为什么! I thought I might just need a longer wait time between mouse overs but went up as high as 20 seconds and it didn't solve the issue. 我以为在鼠标悬停之间可能只需要更长的等待时间,但是上升时间可能高达20秒,但这并不能解决问题。

Here's the code: 这是代码:

 bill_links = soup.find_all('a', {'id': re.compile('Bill')})
 summaries = []
 bill_numbers = [link.text.strip() for link in bill_links]

 for link in bill_links:
   billid = link.get('id')
   action = ActionChains(driver)
   action.move_to_element(driver.find_element_by_id(billid)).perform()
   time.sleep(5)
   summary = driver.find_element_by_class_name("ToolTip-BillSummary-ShortTitle").text
   print(summary)
   summaries = summaries + [summary]
   action.reset_actions()

Again, the first print(summary) command is successfully returning the text of the first tooltip ("An Act amending the act of January 17, 1968...") -- but each subsequent print(summary) command just returns a blank. 同样,第一个print(summary)命令成功返回了第一个工具提示的文本(“正在修改1968年1月17日法案的法案...”),但是随后的每个print(summary)命令仅返回一个空白。

I'm very new to programming, so apologies if there's an obvious answer. 我对编程非常陌生,因此如果有明显答案,我们深表歉意。

tl;dr: TL;博士:

Selenium isn't needed. 不需要硒。 If it is literally the tooltip as shown (not the full text) you can use bs4 and replicate the javascript function the page uses. 如果从字面上看是如图所示的工具提示(不是全文),则可以使用bs4并复制页面使用的javascript函数。 The parameters for the function call are found in the script tag adjacent to the a tag for each bill listings. 在每个票据清单的a标签旁边的脚本标签中找到函数调用的参数。 I regex these out from appropriate string to pass to our user defined function (which replicates jquery function) 我从适当的字符串中将它们进行正则表达式传递给我们的用户定义函数(该函数复制了jquery函数)

在此处输入图片说明

You can see the associated call AddBillSummaryTooltip('#Bill_1',2019,0,'S','B','0012'); 您可以看到关联的调用AddBillSummaryTooltip('#Bill_1',2019,0,'S','B','0012');


Tooltips: 提示:

import requests
from bs4 import BeautifulSoup as bs
import re

def add_bill_summary_tooltip(s, session_year, session_ind, bill_body, bill_type, bill_no):
    url = g_server_url + '/cfdocs/cfc/GenAsm.cfc?returnformat=plain'
    data = { 'method' : 'GetBillSummaryTooltip',
            'SessionYear' : session_year,
            'SessionInd' : session_ind,
            'BillBody' : bill_body,
            'BillType' : bill_type,
            'BillNo' : bill_no,
            'IsAjaxRequest' : '1'
            }

    r = s.get(url, params = data)
    soup = bs(r.content, 'lxml')
    tooltip = soup.select_one('.ToolTip-BillSummary-ShortTitle')
    if tooltip is not None:
        tooltip = tooltip.text.strip()
    return tooltip

g_server_url = "https://www.legis.state.pa.us"

#add_bill_summary_tooltip('#Bill_1',2019,0,'S','B','0012')
with requests.Session() as s:
    r = s.get('https://www.legis.state.pa.us/CFDocs/Legis/BS/bs_action.cfm?SessId=20190&Sponsors=S|44|0|Katie%20J.%20Muth')
    soup = bs(r.content, 'lxml')
    tooltips = {item.select_one('a').text:item.select_one('script').text[:-1] for item in soup.select('.DataTable td:has(a)')}
    p = re.compile(r"'(.*?)',(.*),(.*),'(.*)','(.*)','(.*)'")
    for bill in tooltips:
        arg1,arg2,arg3,arg4,arg5,arg6 = p.findall(tooltips[bill])[0]
        tooltips[bill] = add_bill_summary_tooltip(s, arg2, arg3,arg4,arg5,arg6)

print(tooltips)

Full text: 全文:

If you want full text then you can grab links to full text pages from first page then visit each page in a loop and grab full text: 如果要全文,则可以从首页抓取到全文页面的链接,然后循环访问每个页面并抓取全文:

import requests
from bs4 import BeautifulSoup as bs

def add_bill_summary_full(s, url): 
    r = s.get(url)
    soup = bs(r.content, 'lxml')
    summary = soup.select_one('.BillInfo-Section-Data div')
    if summary is not None:
        summary = summary.text
    return summary

g_server_url = "https://www.legis.state.pa.us"

with requests.Session() as s:
    r = s.get('https://www.legis.state.pa.us/CFDocs/Legis/BS/bs_action.cfm?SessId=20190&Sponsors=S|44|0|Katie%20J.%20Muth')
    soup = bs(r.content, 'lxml')
    full_text = {item.text:g_server_url + item['href'] for item in soup.select('.DataTable a')}
    for k,v in full_text.items():
        full_text[k] = add_bill_summary_full(s, v)

print(full_text)

This is the source code javascript function used by jquery 这是jquery使用的源代码javascript函数

  function AddBillSummaryTooltip(element,SessionYear,SessionInd,BillBody,BillType,BillNo) { jQuery(element).qtip({ content: { text: function(event, api) { jQuery.ajax({ url: g_ServerURL + '/cfdocs/cfc/GenAsm.cfc?returnformat=plain', data: { method: 'GetBillSummaryTooltip', SessionYear: SessionYear, SessionInd: SessionInd, BillBody: BillBody, BillType: BillType, BillNo: BillNo, IsAjaxRequest: 1 } }) 


Regex: 正则表达式:

Try it here . 在这里尝试。

Explanation: 说明:

在此处输入图片说明

If you are using you won't have to use BeautifulSoup . 如果您正在使用 ,则无需使用BeautifulSoup To extract the text of all the tooltips on the page https://www.legis.state.pa.us/CFDocs/Legis/BS/bs_action.cfm?SessId=20190&Sponsors=S|44|0|Katie%20J.%20Muth you can use the following solution: 要提取页面https://www.legis.state.pa.us/CFDocs/Legis/BS/bs_action.cfm?SessId=20190&Sponsors=S|44|0|Katie%20J.%20Muth上所有工具提示的文本https://www.legis.state.pa.us/CFDocs/Legis/BS/bs_action.cfm?SessId=20190&Sponsors=S|44|0|Katie%20J.%20Muth您可以使用以下解决方案:

  • Code Block: 代码块:

     from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.action_chains import ActionChains chrome_options = webdriver.ChromeOptions() chrome_options.add_argument("start-maximized") chrome_options.add_argument('disable-infobars') driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:\\Utility\\BrowserDrivers\\chromedriver.exe') driver.get("https://www.legis.state.pa.us/CFDocs/Legis/BS/bs_action.cfm?SessId=20190&Sponsors=S|44|0|Katie%20J.%20Muth") for elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='DataTable']/tbody//tr/td/a"))): senete_bill_shorten_number = elem.get_attribute("innerHTML").split()[1] ActionChains(driver).move_to_element(elem).perform() print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='ToolTip-BillSummary']/div[@class='ToolTip-BillSummary-Title' and contains(., '" + senete_bill_shorten_number + "')]//following::div[2]"))).get_attribute("innerHTML")) 
  • Console Output: 控制台输出:

      An Act amending the act of January 17, 1968 (PL11, No.5), known as The Minimum Wage Act of 1968, further providing for definitions and for minimum wages; providing for gratuities; further providing for enforcement and rules and regulations, for pe ... An Act providing for mandatory Statewide employer-paid sick leave for employees and for civil penalties and remedies. An Act amending Title 42 (Judiciary and Judicial Procedure) of the Pennsylvania Consolidated Statutes, in judicial boards and commissions, providing for adoption of guidelines for administrative probation violations; and, in sentencing, further provi ... An Act amending the act of May 22, 1951 (PL317, No.69), known as The Professional Nursing Law, further providing for title, for definitions, for State Board of Nursing, for dietitian-nutritionist license required, for unauthorized practices and ac ... An Act amending the act of March 4, 1971 (PL6, No.2), known as the Tax Reform Code of 1971, providing for Pennsylvania Housing Tax Credit. An Act amending the act of December 3, 1959 (PL1688, No.621), known as the Housing Finance Agency Law, in Pennsylvania Housing Affordability and Rehabilitation Enhancement Program, further providing for fund. An Act amending the act of March 10, 1949 (PL30, No.14), known as the Public School Code of 1949, in charter schools, further providing for funding for charter schools. An Act amending the act of June 13, 1967 (PL31, No.21), known as the Human Services Code, in departmental powers and duties as to supervision, providing for lead testing in children's institutions; and, in departmental powers and duties as to lice ... An Act providing for the protection of water supplies. An Act amending Title 35 (Health and Safety) of the Pennsylvania Consolidated Statutes, providing for emergency addiction treatment; and imposing powers and duties on the Department of Drug and Alcohol Programs. An Act amending Title 18 (Crimes and Offenses) of the Pennsylvania Consolidated Statutes, providing for transfer and sale of animals. An Act amending Title 42 (Judiciary and Judicial Procedure) of the Pennsylvania Consolidated Statutes, in particular rights and immunities, providing for civil immunity of person rescuing minor from motor vehicle. An Act providing for health care insurance coverage protections, for duties of the Insurance Department and the Insurance Commissioner, for regulations, for enforcement and for penalties. An Act amending the act of May 17, 1921 (PL682, No.284), known as The Insurance Company Law of 1921, in casualty insurance, providing coverage for essential health benefits. An Act amending the act of October 27, 1955 (PL744, No.222), known as the Pennsylvania Human Relations Act, further providing for definitions and for unlawful discriminatory practices. An Act amending Titles 18 (Crimes and Offenses) and 42 (Judiciary and Judicial Procedure) of the Pennsylvania Consolidated Statutes, in human trafficking, further providing for the offense of trafficking in individuals and for the offense of patroniz ... An Act amending Title 75 (Vehicles) of the Pennsylvania Consolidated Statutes, in registration of vehicles, further providing for veteran plates and placard. An Act providing for health insurance coverage requirements for stage four, advanced metastatic cancer. An Act authorizing the Commonwealth of Pennsylvania to join the Psychology Interjurisdictional Compact; providing for the form of the compact; imposing additional powers and duties on the Governor, the Secretary of the Commonwealth and the Compact. An Act amending Titles 42 (Judiciary and Judicial Procedure) and 75 (Vehicles) of the Pennsylvania Consolidated Statutes, in sentencing, further providing for payment of court costs, restitution and fines, for fine and for failure to pay fine; in lic ... An Act amending the act of January 17, 1968 (PL11, No.5), known as The Minimum Wage Act of 1968, further providing for definitions and for rate of minimum wages; and providing for reporting by the Department of Labor and Industry. An Act amending Title 23 (Domestic Relations) of the Pennsylvania Consolidated Statutes, in marriage license, further providing for restrictions on issuance of license. An Act amending the act of March 4, 1971 (PL6, No.2), known as the Tax Reform Code of 1971, in sales and use tax, further providing for exclusions from tax. 

The problem might due to this line of your code: 该问题可能是由于您的代码的这一行:

summary = driver.find_element_by_class_name("ToolTip-BillSummary-ShortTitle").text

your condition for finding the corresponding element is only restricted by the class name of that element, this single condition might gave you a list of elements, but you were actually not specifying which one to get the text. 您查找相应元素的条件仅受该元素的类名限制,该条件可能会为您提供元素列表,但实际上您并未指定要获取文本的元素。

To fix this, use an xpath expression instead (you need to use an index variable to locate the element): 要解决此问题,请改用xpath表达式(您需要使用索引变量来定位元素):

summary = driver.find_element_by_xpath("//*[@id="qtip-" + <index> + "-content"]/div/div[3]").text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM