简体   繁体   English

使用python抓取隐藏的href

[英]web-scraping hidden href using python

I´m using python to get all the possible href from the following webpage:我正在使用 python 从以下网页获取所有可能的 href:

http://www.congresovisible.org/proyectos-de-ley/ http://www.congresovisible.org/proyectos-de-ley/

example these two例如这两个

href="ppor-medio-de-la-cual-se-dictan-medidas-para-defender-el-acceso-de-los-usuarios-del-sistema-de-salud-a-medicamentos-de-calidad-eficacia-y-seguridad-acceso-de-los-usuarios-del-sistema-de-salud-a-medicamentos/8683">

href="ppor-medio-del-cual-el-congreso-de-la-republica-facultado-por-el-numeral-17-del-articulo-150-de-la-constitucion-politica-de-colombia-y-en-aras-de-facilitar-la-paz-decreta-otorgar-amnistia-e-indulto-a-los-miembros-del-grupo-armado-organizado-al-margen-de-la-ley-farc-ep/8682">

and at the end have a list with all possible href in that page.最后有一个包含该页面中所有可能的 href 的列表。

However, by clicking in ver todos ("see all") there are more hrefs.但是,通过单击 ver todos(“查看全部”),可以获得更多的 href。 But if you check the source page, even if you add /#page=4 or whatever page to the url, the total hrefs remain the same (actually the page source doesn't change).但是如果您检查源页面,即使您将 /#page=4 或任何页面添加到 url,总的 href 保持不变(实际上页面源并没有改变)。 How could I get all those hidden hrefs?我怎么能得到所有这些隐藏的hrefs?

Prenote: I assume you use Python 3+.预注:我假设您使用 Python 3+。

What happens is, you click "See All", it requests an API, takes data, dumps into view.发生的情况是,您单击“查看全部”,它会请求 API、获取数据、转储到视图中。 This is all AJAX process.这就是所有的 AJAX 过程。

The hard and complicated way is to use, Selenium, but there is no need actually.复杂的方法是使用Selenium,但实际上没有必要。 With a little bit debug on browser, you can see where it loads the data .通过在浏览器上进行一些调试,您可以看到它加载数据的位置

This is page one.这是第一页。 q is probably search query, page is exactly which page. q可能是搜索查询, page正是哪个页面。 5 element per page.每页 5 个元素。 You can request it via urllib or requests and parse it with json package into a dict.您可以通过urllibrequests并使用json包将其解析为 dict。


A Simple Demonstration一个简单的演示

I wanted to try it myself and it seems server we get the data from needs a User-Agent header to process, otherwise, it simply throws 403 (Forbidden).我想自己尝试一下,似乎我们从中获取数据的服务器需要一个User-Agent标头来处理,否则,它只会抛出403 (禁止)。 I am trying on Python 3.5.1.我正在尝试使用 Python 3.5.1。

from urllib.request import urlopen, Request
import json

# Creating headers as dict, to pass User-Agent. I am using my own User-Agent here.
# You can use the same or just google it.
# We need to use User-Agent, otherwise, server does not accept request and returns 403.
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36 OPR/39.0.2256.48"
}

# Creating a Request object.
# See, we pass headers, below.
req = Request("http://www.congresovisible.org/proyectos-de-ley/search/proyectos-de-ley/?q=%20&page=1", headers=headers)

# Getting a response
res = urlopen(req)

# The thing is, it returns binary, we need to convert it to str in order to pass it on json.loads function.
# This is just a little bit complicated.
data_b = res.read()
data_str = data_b.decode("utf-8")

# Now, this is the magic.
data = json.loads(data_str)

print(data)

# Now you can manipulate your data. :)

For Python 2.7对于 Python 2.7

  • You can use urllib2 .您可以使用urllib2 urllib2 does not seperate into packages like it does in Python 3. So, all you have to do from urllib2 import Request, urlopen . urllib2不像在 Python 3 中那样分离成包。所以,你需要做的就是from urllib2 import Request, urlopen

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM