無法使用請求模塊從 static 網頁中抓取不同的公司名稱

Question

我已經創建了一個腳本來使用請求模塊從該網站收集不同的公司名稱，但是當我執行該腳本時，它最終什么也得不到。 我在頁面源中查找公司名稱，發現名稱在那里可用，所以它們似乎是 static。

import requests
from bs4 import BeautifulSoup

link = 'https://clutch.co/agencies/digital-marketing'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("h3.company_info > a"):
        print(item.text)

Answer 1

由於該網站受 Cloudflare 保護，因此有一個名為cloudscraper的 python 模塊試圖繞過 Cloudflare 的反機器人頁面。

使用該模塊，您可以獲得所需的數據。

例如：

import cloudscraper
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate

scraper = cloudscraper.create_scraper()
source_html = scraper.get("https://clutch.co/agencies/digital-marketing").text

soup = BeautifulSoup(source_html, "lxml")
company_data = [
    [item.getText(strip=True), f"https://clutch.co{item['href']}"]
    for item in soup.select("h3.company_info > a")
]

df = pd.DataFrame(company_data, columns=["Company", "URL"])
print(tabulate(df, headers="keys", tablefmt="github", showindex=False))

這應該打印：

| Company                          | URL                                                        |
|----------------------------------|------------------------------------------------------------|
| WebFX                            | https://clutch.co/profile/webfx                            |
| Ignite Visibility                | https://clutch.co/profile/ignite-visibility                |
| SocialSEO                        | https://clutch.co/profile/socialseo                        |
| Lilo Social                      | https://clutch.co/profile/lilo-social                      |
| Favoured                         | https://clutch.co/profile/favoured                         |
| Power Digital                    | https://clutch.co/profile/power-digital                    |
| Belkins                          | https://clutch.co/profile/belkins                          |
| SmartSites                       | https://clutch.co/profile/smartsites                       |
| Straight North                   | https://clutch.co/profile/straight-north                   |
| Victorious                       | https://clutch.co/profile/victorious                       |
| Uplers                           | https://clutch.co/profile/uplers                           |
| Daniel Brian Advertising         | https://clutch.co/profile/daniel-brian-advertising         |
| Thrive Internet Marketing Agency | https://clutch.co/profile/thrive-internet-marketing-agency |
| Big Leap                         | https://clutch.co/profile/big-leap                         |
| Mad Fish Digital                 | https://clutch.co/profile/mad-fish-digital                 |
| Razor Rank                       | https://clutch.co/profile/razor-rank                       |
| Brolik                           | https://clutch.co/profile/brolik                           |
| Search Berg                      | https://clutch.co/profile/search-berg                      |
| Socialfix Media                  | https://clutch.co/profile/socialfix-media                  |
| Kanbar Digital, LLC              | https://clutch.co/profile/kanbar-digital                   |
| NextLeft                         | https://clutch.co/profile/nextleft                         |
| Fruition                         | https://clutch.co/profile/fruition                         |
| Impactable                       | https://clutch.co/profile/impactable                       |
| Lets Tok                         | https://clutch.co/profile/lets-tok                         |
| Pyxl                             | https://clutch.co/profile/pyxl                             |
| Sagefrog Marketing Group         | https://clutch.co/profile/sagefrog-marketing-group         |
| Foreignerds INC.                 | https://clutch.co/profile/foreignerds                      |
| Social Driver                    | https://clutch.co/profile/social-driver                    |
| 3 Media Web                      | https://clutch.co/profile/3-media-web                      |
| Brand Vision                     | https://clutch.co/profile/brand-vision-1                   |

Answer 2

嘗試這個：

from selenium import webdriver
from bs4 import BeautifulSoup
import time

link = 'https://clutch.co/agencies/digital-marketing'


driver = webdriver.Chrome()

# Go to the website
driver.get(link)

# Wait for the page to load
time.sleep(5)

# Get the page source
html = driver.page_source

# Create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')

# Find all elements with class "company_info" and extract the text
for item in soup.select("h3.company_info > a"):
    print(item.text)
    
# Close the browser
driver.quit()

無法使用請求模塊從 static 網頁中抓取不同的公司名稱

問題描述

2 個解決方案

解決方案1
0 2023-01-20 09:44:29

解決方案2
-1 2023-01-27 07:08:38

無法使用請求模塊從 static 網頁中抓取不同的公司名稱

問題描述

2 個解決方案

解決方案1 0 2023-01-20 09:44:29

解決方案2 -1 2023-01-27 07:08:38

解決方案1
0 2023-01-20 09:44:29

解決方案2
-1 2023-01-27 07:08:38