简体   繁体   中英

Can't get text without tag using Selenium Python

first of all, I'll show the code that I'm having problem to in order to better explain myself.

<div class="archivos"> ... </div>
<br>
<br>
<br>
<br>
THIS IS THE TEXT THAT I WANT TO CHECK
<div class="archivos"> ... </div>
...

I'm using Selenium in Python.

So, this is a piece of the html that I'm working with. My objective is, inside the div with "class=archivos", there's a link that i want to click, but for that, I need to first analyze the text that's over it to know if I want to click or not the link.

The problem is that there's no tag on the text, and I can't seem to find a way to copy it so I can search it for the information I want. The text changes every time so I need to locate the possible texts previous to every "class=archivos".

So far I've tried a lot of ways to find it using XPath mainly, trying to get to the previous element of the div. I haven't come with anything that works yet, as I'm not very experienced with Selenium and XPaths.

I've found this https://chercher.tech/python/relative-xpath-selenium-python,which helped me try some XPaths, and several responses here on SO but to no avail.

I've read somewhere that I can use Javascript code from Python using Selenium to get it, but I don't know Javascript and don't know how to do it. Maybe somebody understands what I'm talking about.

This is the webpage if it helps: http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERLST&DOCS=1-200&BASE=BOLE&SEC=FIRMA&SEPARADOR=&PUBL=20200901

Thanks in advance for the help, and I'll provide any further information if it's needed.

Here is example how to extract the previous text with BeautifulSoup. I loaded the page with requests module, but you can feed the HTML source to BeautifulSoup from selenium :

import requests
from bs4 import BeautifulSoup


url = 'http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERLST&DOCS=1-200&BASE=BOLE&SEC=FIRMA&SEPARADOR=&PUBL=20200901'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for t in soup.select('.archivos'):
    previous_text = t.find_previous(text=True).strip()
    link = t.a['href']
    print(previous_text)
    print('http://www.boa.aragon.es' + link)
    print('-' * 80)

Prints:

ORDEN HAP/804/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo de los Departamentos de Industria, Competitividad y Desarrollo Empresarial y de Economía, Planificación y Empleo.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=1&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN HAP/805/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo del Departamento de Agricultura, Ganadería y Medio Ambiente.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=2&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN HAP/806/2020, de 17 de agosto, por la que se modifica la Relación de Puestos de Trabajo del Organismo Autónomo Instituto Aragonés de Servicios Sociales.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=3&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
ORDEN ECD/807/2020, de 24 de agosto, por la que se aprueba el expediente relativo al procedimiento selectivo de acceso al Cuerpo de Catedráticos de Música y Artes Escénicas.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=4&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------
RESOLUCIÓN de 28 de julio de 2020, de la Dirección General de Justicia, por la que se convocan a concurso de traslado plazas vacantes entre funcionarios de los Cuerpos y Escalas de Gestión Procesal y Administrativa, Tramitación Procesal y
Administrativa y Auxilio Judicial de la Administración de Justicia.
http://www.boa.aragon.es/cgi-bin/EBOA/BRSCGI?CMD=VERDOC&BASE=BOLE&PIECE=BOLE&DOCS=1-22&DOCR=5&SEC=FIRMA&RNG=200&SEPARADOR=&&PUBL=20200901
--------------------------------------------------------------------------------

...and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM