简体   繁体   English

Web 在 javascript 动态网站中用 python 抓取

[英]Web scraping with python in javascript dynamic website

I need to scrape all article, title of article and paragraf in this web: https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19我需要在这个 web 中抓取所有文章、文章标题和段落: https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19

The problem is than I tried some of div, h3 or p nothing happen add image .问题是我尝试了一些 div、h3 或 p 没有任何反应添加图像

from bs4 import BeautifulSoup
import lxml
import pandas as pd
from tqdm import tqdm_notebook


def parse_url(url):
    response = requests.get(url)
    content = response.content
    parsed_response = BeautifulSoup(content, "lxml")
    return parsed_response


url = "https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19"

soup = parse_url(url)


article = soup.find("div", {"class":"article-document"})

article

It seems to be a website with javascript, but I don't know how to get it.好像是javascript的网站,不知道怎么弄的。

The website does 3 API calls in order to get the data.该网站执行 3 个 API 调用以获取数据。
The code below does the same and get the data.下面的代码执行相同的操作并获取数据。

(In the browser do F12 -> Network -> XHR and see the API calls) (在浏览器中执行 F12 -> Network -> XHR 并查看 API 调用)

import requests

payload1 = {'language':'ca','documentId':680124}
r1 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListTraceabilityStandard',data = payload1)
if r1.status_code == 200:
  print(r1.json())

print('------------------')
payload2 = {'documentId':680124,'orderBy':'DESC','language':'ca','traceability':'02'}
r2 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListValidityByDocument',data = payload2)
if r2.status_code == 200:
  print(r2.json())

print('------------------')

payload3 = {'documentId': 680124,'traceabilityStandard': '02','language': 'ca'}
r3 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/documentPJC',data=payload3)
if r3.status_code == 200:
  print(r3.json())

You can use Selenium to automate web browser interaction as it simulates a browser and you can wait till the JavaScript component loads completely.您可以使用 Selenium 来自动化 web 浏览器交互,因为它模拟浏览器,您可以等到 JavaScript 组件完全加载。

It provides you an option for using headless chrome instead as well ( just like in the example below )它还为您提供了使用无头 chrome 的选项(就像下面的示例一样)

You can check out the following script which scrapes the titles and all the paragraphs from the URL, and saves them in a txt file.您可以查看以下脚本,它从 URL 中抓取标题和所有段落,并将它们保存在一个 txt 文件中。

import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = Chrome(options=chrome_options)

url = "https://portaljuridic.gencat.cat/eli/es-ct/l/2014/12/29/19"
driver.get(url)
time.sleep(5)

title = driver.find_element_by_css_selector(".titol-document").text
print(title)

paragraphs = driver.find_element_by_css_selector(
    "div#art-1").find_elements_by_css_selector("div")[1:]
print(paragraphs)

file = open("article.txt", "w")
for paragraph in paragraphs:
    file.write(paragraph.text)

You can adjust the time.sleep function as per the.network speed.您可以根据网络速度调整时间。睡眠 function。

You can read more about Selenium over here您可以在此处阅读有关 Selenium 的更多信息

Also, as mentioned in the comments to the prior answer - this will automatically extract the content where all special characters would already be parsed.此外,如先前答案的评论中所述 - 这将自动提取已解析所有特殊字符的内容。

If you don't want to use a browser like Selenium and its counterparts (eg Puppeteer, Playwright), you can use solutions that offer JS rendering like some of the web scraping APIs recommended in this article .如果您不想使用像 Selenium 及其对应的浏览器(例如 Puppeteer、Playwright),您可以使用提供 JS 渲染的解决方案,例如本文推荐的一些web 抓取 API

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM