简体   繁体   中英

Python Web Scraping - HTML error returning incomplete

When using my code, HTML is coming back missing data. What can it be?
Before, everything was working fine, until changes were made to the code for expected conditions Selenium,

Code is not all complete because it was not accepted here, but I think you can see what is happening.

navegador = webdriver.Firefox(options = options)

wait = WebDriverWait(navegador, 30)

link = '******'
navegador.get(url = link)

wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtLogin"))).send_keys('******')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtSenha"))).send_keys('******')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_btnEnviar"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_TreeView2t8"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[title='07 de dezembro']"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"))).click()
wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"]/option[2]'))).click()
teste = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="divScroll"]'))).get_attribute('innerHTML')

soup = BeautifulSoup(teste, "html.parser")

I get the following back.

<table align="center" style="border-right: #66cc00 1px solid; border-top: #66cc00 1px solid; border-left: #66cc00 1px solid; border-bottom: #66cc00 1px solid" width="100%">
<tbody><tr>
<td>
<table>
<tbody><tr>
<td class="Titulo">
<span id="ctl00_ctl00_Content_Content_Label1" style="font-size:12px;">Terminal - Empresa - Exportador:</span>
</td>
<td>
<select class="TextBox" id="ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa" name="ctl00$ctl00$Content$Content$ddlVagasTerminalEmpresa" onchange="javascript:setTimeout('__doPostBack(\'ctl00$ctl00$Content$Content$ddlVagasTerminalEmpresa\',\'\')', 0)" style="width: 475px;">
<option selected="selected" value="0">Selecione um Terminal.</option>
<option value="68623">TEAG - CARGILL - 04 CARGILL AGRICOLA S A  -  GUARUJA - SP</option>
<option value="68594">TEG  - CARGILL - 04 CARGILL AGRICOLA S A  -  GUARUJA - SP</option>
</select>
</td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td class="Titulo">
<span id="ctl00_ctl00_Content_Content_lbl_titulo_principal" style="font-size:12px;">Disponibilização de vagas do dia: 07/12/2022</span></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td valign="top">
</td>
</tr>
<tr>

I should get that back.

        </tr>
    <tr>
        <td></td>
    </tr>
    <tr>
        <td valign="top">
            <div id="ctl00_ctl00_Content_Content_pn_turno_1" style="width:100%;">
    
            <table width="100%" style="border-right: #66cc00 1px solid; border-top: #66cc00 1px solid; border-left: #66cc00 1px solid; border-bottom: #66cc00 1px solid">
                <tbody><tr>
                    <td class="Titulo">
                        <span id="ctl00_ctl00_Content_Content_lbl_turno_1">Turno 01 - intervalo: 7/12/2022 0:00:00 as 7/12/2022 1:00:00</span></td>
                </tr>
                <tr>
                    <td style="height:200px;width: 100%;" valign="top">
                        <table border="0" class="Grid" cellpadding="4" cellspacing="2" style="font-size:14;width: 100%;z-index: -1;">
                                                                   
                                    </table>                                                                    
                                    <table border="0" class="Grid" cellpadding="3" cellspacing="2" style="font-size:14;width: 100%">
                                
                                    <tbody><tr class="GridRow">                                
                                        <td width="12%" align="center">
                                            <span id="ctl00_ctl00_Content_Content_rpt_turno_1_ctl01_lblEmpresaTerminal_1" title="TEAG - CARGILL - 04 CARGILL AGRICOLA S A  -  GUARUJA - SP" style="font-size:7px;">CARGILL - TEAG</span>
                                            <input type="image" name="ctl00$ctl00$Content$Content$rpt_turno_1$ctl01$imb_vaga_1" id="ctl00_ctl00_Content_Content_rpt_turno_1_ctl01_imb_vaga_1" title="Vaga agendada." src="../App_Themes/SisLog/Images/caminhao.png" onclick="javascript:window.open('Cadastro.aspx?id_agenda=7054462&amp;id_turno=7/12/2022 0:00:00;7/12/2022 1:00:00&amp;data=07/12/2022&amp;id_turno_exportador=198574&amp;id_turno_agenda=61348&amp;id_transportadora=23213&amp;id_turno_transp=68623&amp;id_Cliente=7708&amp;codigo_terminal=7708&amp;codigo_empresa=1&amp;codigo_exportador=24978&amp;codigo_transportador=23213&amp;codigo_turno=1&amp;turno_transp_vg=68623','_blank','height=850,width=1000,top=(screen.width)?(screen.width-1000)/2 : 0,left=(screen.height)?(screen.height-700)/2 : 0,toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=yes,resizable=no');" style="height:20px;border-width:0px;">                                                
                                        </td>

Since you did not share a link to the page you working on we can only guess what can cause your problem.
So, I guess you are extracting the text from not fully rendered element.
To try fix this try changing from presence_of_element_located to visibility_of_element_located in this line teste = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="divScroll"]'))).get_attribute('innerHTML') so it will be

teste = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="divScroll"]'))).get_attribute('innerHTML')

In case this will not be enough try adding some delay before extracting the text, as following:

wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="divScroll"]')))
time.sleep(2)
teste = navegador.find_element(By.XPATH, '//*[@id="divScroll"]').get_attribute('innerHTML')

And in case that element is not visible so that visibility_of_element_located can not be applied on it just use presence_of_element_located with delay

wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="divScroll"]')))
time.sleep(2)
teste = navegador.find_element(By.XPATH, '//*[@id="divScroll"]').get_attribute('innerHTML')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM