简体   繁体   English

无法使用请求模块从 static 网页中抓取不同的公司名称

[英]Unable to scrape different company names from a static webpage using the requests module

I've created a script to collect the different company names from this website using the requests module, but when I execute the script, it ends up getting nothing.我已经创建了一个脚本来使用请求模块从该网站收集不同的公司名称,但是当我执行该脚本时,它最终什么也得不到。 I looked for the company names in the page source and found that the names are available there, so they seem to be static.我在页面源中查找公司名称,发现名称在那里可用,所以它们似乎是 static。

import requests
from bs4 import BeautifulSoup

link = 'https://clutch.co/agencies/digital-marketing'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("h3.company_info > a"):
        print(item.text)

Since the website's protected by Cloudflare there's a python module called cloudscraper that attempts to bypass Cloudflare's anti-bot page.由于该网站受 Cloudflare 保护,因此有一个名为cloudscraper的 python 模块试图绕过 Cloudflare 的反机器人页面。

Using the module you could be able to get the data you need.使用该模块,您可以获得所需的数据。

For example:例如:

import cloudscraper
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate

scraper = cloudscraper.create_scraper()
source_html = scraper.get("https://clutch.co/agencies/digital-marketing").text

soup = BeautifulSoup(source_html, "lxml")
company_data = [
    [item.getText(strip=True), f"https://clutch.co{item['href']}"]
    for item in soup.select("h3.company_info > a")
]

df = pd.DataFrame(company_data, columns=["Company", "URL"])
print(tabulate(df, headers="keys", tablefmt="github", showindex=False))

This should print:这应该打印:

| Company                          | URL                                                        |
|----------------------------------|------------------------------------------------------------|
| WebFX                            | https://clutch.co/profile/webfx                            |
| Ignite Visibility                | https://clutch.co/profile/ignite-visibility                |
| SocialSEO                        | https://clutch.co/profile/socialseo                        |
| Lilo Social                      | https://clutch.co/profile/lilo-social                      |
| Favoured                         | https://clutch.co/profile/favoured                         |
| Power Digital                    | https://clutch.co/profile/power-digital                    |
| Belkins                          | https://clutch.co/profile/belkins                          |
| SmartSites                       | https://clutch.co/profile/smartsites                       |
| Straight North                   | https://clutch.co/profile/straight-north                   |
| Victorious                       | https://clutch.co/profile/victorious                       |
| Uplers                           | https://clutch.co/profile/uplers                           |
| Daniel Brian Advertising         | https://clutch.co/profile/daniel-brian-advertising         |
| Thrive Internet Marketing Agency | https://clutch.co/profile/thrive-internet-marketing-agency |
| Big Leap                         | https://clutch.co/profile/big-leap                         |
| Mad Fish Digital                 | https://clutch.co/profile/mad-fish-digital                 |
| Razor Rank                       | https://clutch.co/profile/razor-rank                       |
| Brolik                           | https://clutch.co/profile/brolik                           |
| Search Berg                      | https://clutch.co/profile/search-berg                      |
| Socialfix Media                  | https://clutch.co/profile/socialfix-media                  |
| Kanbar Digital, LLC              | https://clutch.co/profile/kanbar-digital                   |
| NextLeft                         | https://clutch.co/profile/nextleft                         |
| Fruition                         | https://clutch.co/profile/fruition                         |
| Impactable                       | https://clutch.co/profile/impactable                       |
| Lets Tok                         | https://clutch.co/profile/lets-tok                         |
| Pyxl                             | https://clutch.co/profile/pyxl                             |
| Sagefrog Marketing Group         | https://clutch.co/profile/sagefrog-marketing-group         |
| Foreignerds INC.                 | https://clutch.co/profile/foreignerds                      |
| Social Driver                    | https://clutch.co/profile/social-driver                    |
| 3 Media Web                      | https://clutch.co/profile/3-media-web                      |
| Brand Vision                     | https://clutch.co/profile/brand-vision-1                   |

Try This:尝试这个:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

link = 'https://clutch.co/agencies/digital-marketing'


driver = webdriver.Chrome()

# Go to the website
driver.get(link)

# Wait for the page to load
time.sleep(5)

# Get the page source
html = driver.page_source

# Create a BeautifulSoup object
soup = BeautifulSoup(html, 'lxml')

# Find all elements with class "company_info" and extract the text
for item in soup.select("h3.company_info > a"):
    print(item.text)
    
# Close the browser
driver.quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM