使用 BeautifulSoup 访问 href 链接

Question

I'm trying to scrape the href of the first link titled "BACC B ET A COMPTABILITE CONSEIL".我正在尝试抓取标题为“BACC B ET A COMPTABILITE CONSEIL”的第一个链接的 href。 However, I can't seem to extract the href when I'm using BeautifulSoup.但是，当我使用 BeautifulSoup 时，我似乎无法提取 href。 Could you please recommend a solution?你能推荐一个解决方案吗？

Here's the link to the url - https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160这是 url - https 的链接：//www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160

My code:我的代码：

              url = 'https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160'
              headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, 
              like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
              resp = requests.get(str(url), headers=headers)
              soup = BeautifulSoup(resp.content, 'html.parser')

              a = soup.find('div', {'class': 'nom-entreprise'})
              print(a)

Result:结果：

None.没有任何。

Answer 1

The link is constructed dynamically with JavaScript.该链接是使用 JavaScript 动态构建的。 All you need is a number, which is obtained with Ajax query:您只需要一个数字，它是通过 Ajax 查询获得的：

import json
import requests

# url = "https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160"
api_url = "https://api.pappers.fr/v2/recherche"

payload = {
    "q": "B & A COMPTABILITE CONSEIL",  # <-- your search query
    "code_naf": "",
    "code_postal": "94160",  # <-- this is "ville" from URL
    "api_token": "97a405f1664a83329a7d89ebf51dc227b90633c4ba4a2575",
    "precision": "standard",
    "bases": "entreprises,dirigeants,beneficiaires,documents,publications",
    "page": "1",
    "par_page": "20",
}

data = requests.get(api_url, params=payload).json()

# uncomment this to print all data (all details):
# print(json.dumps(data, indent=4))

print("https://www.pappers.fr/entreprise/" + data["resultats"][0]["siren"])

Prints:印刷：

https://www.pappers.fr/entreprise/378002208

Opening the link will automatically redirects to:打开链接将自动重定向到：

https://www.pappers.fr/entreprise/bacc-b-et-a-comptabilite-conseil-378002208

Answer 2

The website uses is loaded dynamically, therefore requests doesn't support it.该网站使用的是动态加载的，因此requests不支持它。 We can use Selenium as an alternative to scrape the page.我们可以使用Selenium作为抓取页面的替代方案。

Install it with: pip install selenium .安装它： pip install selenium 。

Download the correct ChromeDriver from here .从这里下载正确的 ChromeDriver。

To find the links you can use a CSS selector: a.gros-gros-nom要查找链接，您可以使用 CSS 选择器： a.gros-gros-nom

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver

url = "https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160"

driver = webdriver.Chrome()
driver.get(url)

# Wait for the link to be visible on the page and save element to a variable `link`
link = WebDriverWait(driver, 20).until(
    EC.visibility_of_element_located((By.CSS_SELECTOR, "a.gros-gros-nom"))
)

print(link.get_attribute("href"))


driver.quit()

Output: Output：

https://www.pappers.fr/entreprise/bacc-b-et-a-comptabilite-conseil-378002208

使用 BeautifulSoup 访问 href 链接

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-05-23 19:32:04

解决方案2
1 2021-05-23 19:34:01

使用 BeautifulSoup 访问 href 链接

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-05-23 19:32:04

解决方案2 1 2021-05-23 19:34:01

解决方案1
2 已采纳 2021-05-23 19:32:04

解决方案2
1 2021-05-23 19:34:01