简体   繁体   English

尝试从站点抓取搜索结果 - Python

[英]Attempt to scrape search results from a site - Python

I needed to scrape the telefone numbers and the email addreses from the following using python:我需要使用 python 从以下地址中抓取电话号码和 email 地址:

url = 'https://rma.cultura.gob.ar/#/app/museos/resultados?provincias=Buenos%20Aires'

source = requests.get(url).text

soup = BeautifulSoup(source, 'lxml')

print(soup)

The problem is that what I get from the requests.get is not the html that I need.问题是我从 requests.get 得到的不是我需要的 html。 I suppose the site uses javascript to show those results but I'm not familiar with that since I'm just starting with python programming.我想该站点使用 javascript 来显示这些结果,但我对此并不熟悉,因为我刚刚开始使用 python 编程。 I solved this by copying the code of each result page to an unique text file and then extracting the emails with regex but I'm curious if there is something simple to be done to access the data directly.我通过将每个结果页面的代码复制到一个唯一的文本文件,然后使用正则表达式提取电子邮件来解决这个问题,但我很好奇是否有一些简单的事情可以直接访问数据。

The data you see on the page is loaded from external URL via JavaScript. To get the data you can use requests / json modules, for example:您在页面上看到的数据是从外部 URL 通过 JavaScript 加载的。要获取数据,您可以使用requests / json模块,例如:

import json
import requests

api_url = "https://rmabackend.cultura.gob.ar/api/museos"

params = {
    "estado": "Publicado",
    "grupo": "Museo",
    "o": "p",
    "ordenar": "nombre_oficial_institucion",
    "page": 1,
    "page_size": "12",
    "provincias": "Buenos Aires",
}

while True:
    data = requests.get(api_url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    for d in data["data"]:
        print(d["attributes"]["nombre-oficial-institucion"])

    if params["page"] == data["meta"]["pagination"]["pages"]:
        break

    params["page"] += 1

Prints:印刷:

2 Museos, Bellas Artes y MAC
Archivo Histórico y Museo "Astillero Río Santiago" (ARS)
Archivo Histórico y Museo del Servicio Penitenciario Bonaerense
Archivo y Museo Historico Municipal Roberto T. Barili "Villa Mitre"
Asociación Casa Bruzzone
Biblioteca Popular y Museo "José Manuel Estrada"
Casa Museo "Haroldo Conti"
Casa Museo "Xul Solar" -  Tigre
Complejo Histórico y Museográfico "Dr. Alfredo Antonio Sabaté"


...and so on.

The page is using AJAX to load content.该页面正在使用 AJAX 加载内容。 Using something like Selenium to simulate the browser will allow all the javascript to run and then you can extract the source:使用类似 Selenium 的东西来模拟浏览器将允许所有 javascript 运行,然后您可以提取源代码:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()
url = 'https://rma.cultura.gob.ar/#/app/museos/resultados?provincias=Buenos%20Aires'

# navigate to the page
driver.get(url)
# wait until a link with text 'ficha' has loaded
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, 'ficha')))
source = driver.page_source
soup = BeautifulSoup(source, features='lxml')
driver.quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM