简体   繁体   中英

web-scraping hidden href using python

I´m using python to get all the possible href from the following webpage:

http://www.congresovisible.org/proyectos-de-ley/

example these two

href="ppor-medio-de-la-cual-se-dictan-medidas-para-defender-el-acceso-de-los-usuarios-del-sistema-de-salud-a-medicamentos-de-calidad-eficacia-y-seguridad-acceso-de-los-usuarios-del-sistema-de-salud-a-medicamentos/8683">

href="ppor-medio-del-cual-el-congreso-de-la-republica-facultado-por-el-numeral-17-del-articulo-150-de-la-constitucion-politica-de-colombia-y-en-aras-de-facilitar-la-paz-decreta-otorgar-amnistia-e-indulto-a-los-miembros-del-grupo-armado-organizado-al-margen-de-la-ley-farc-ep/8682">

and at the end have a list with all possible href in that page.

However, by clicking in ver todos ("see all") there are more hrefs. But if you check the source page, even if you add /#page=4 or whatever page to the url, the total hrefs remain the same (actually the page source doesn't change). How could I get all those hidden hrefs?

Prenote: I assume you use Python 3+.

What happens is, you click "See All", it requests an API, takes data, dumps into view. This is all AJAX process.

The hard and complicated way is to use, Selenium, but there is no need actually. With a little bit debug on browser, you can see where it loads the data .

This is page one. q is probably search query, page is exactly which page. 5 element per page. You can request it via urllib or requests and parse it with json package into a dict.


A Simple Demonstration

I wanted to try it myself and it seems server we get the data from needs a User-Agent header to process, otherwise, it simply throws 403 (Forbidden). I am trying on Python 3.5.1.

from urllib.request import urlopen, Request
import json

# Creating headers as dict, to pass User-Agent. I am using my own User-Agent here.
# You can use the same or just google it.
# We need to use User-Agent, otherwise, server does not accept request and returns 403.
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36 OPR/39.0.2256.48"
}

# Creating a Request object.
# See, we pass headers, below.
req = Request("http://www.congresovisible.org/proyectos-de-ley/search/proyectos-de-ley/?q=%20&page=1", headers=headers)

# Getting a response
res = urlopen(req)

# The thing is, it returns binary, we need to convert it to str in order to pass it on json.loads function.
# This is just a little bit complicated.
data_b = res.read()
data_str = data_b.decode("utf-8")

# Now, this is the magic.
data = json.loads(data_str)

print(data)

# Now you can manipulate your data. :)

For Python 2.7

  • You can use urllib2 . urllib2 does not seperate into packages like it does in Python 3. So, all you have to do from urllib2 import Request, urlopen .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM