简体   繁体   English

如何使用 Beautiful soup 在一个 Python 脚本中抓取多个 url

[英]How to scrape multiple urls in one Python script with Beautiful soup

I have the following Python script that works well for what I need and gives me the output I want.我有以下 Python 脚本,它可以很好地满足我的需要,并为我提供了我想要的 output。 However, I have another url ( https://www.website2/page- ) that I'd like to add to the script.但是,我还有另一个 url ( https://www.website2/page- ) 我想添加到脚本中。 Currently I manually swap the urls and run them as separate scripts but I'd like to do it in one go, is this possible?目前我手动交换 url 并将它们作为单独的脚本运行,但我想在一个 go 中完成,这可能吗?

Ps - required script for each site is identical other than the url property. Ps - 除了 url 属性之外,每个站点所需的脚本都是相同的。 TIA! TIA!

import itertools
import random
import time
import typing
import signal

import requests
from bs4 import BeautifulSoup

from model import Model, Data

RUNNING = True


def sigint_handler(*args: typing.Any) -> None:
    global RUNNING
    print("Signal received, exiting gracefully ...")
    RUNNING = False


def scrape(url: str, model: Model, session: requests.Session, headers: typing.Dict[str, str]) -> None:
    for page in itertools.count(1):
        if not RUNNING:
            break
        req = session.get(f"{url}{page}", headers=headers)
        soup = BeautifulSoup(req.content, 'html.parser')

        for li in soup.find_all('li', class_="container"):
            title = li.find('h2').text
            price = li.find('p', class_="price-text").text
            print(f"Title: {title}, Price: {price}")
            model.insert_or_update(Data(address=title, price=price))

        time.sleep(random.randint(1, 5))


def run() -> None:
    url = "https://www.website1/page-"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
    }
    model, session = Model(), requests.Session()
    scrape(url, model, session, headers)


if __name__ == '__main__':
    signal.signal(signal.SIGINT, sigint_handler)
    run()

if you want it to parse another page add a comma如果你想让它解析另一个页面添加一个逗号

## Added Comma meaning 2 urls can be parsed for efficiency as you wouldn't want to rewrite your code again
import itertools
import random
import time
import typing
import signal
import requests
from bs4 import BeautifulSoup
from model import Model, Data
RUNNING = True
def sigint_handler(*args: typing.Any) -> None:
    global RUNNING
    print("Signal received, exiting gracefully ...")
    RUNNING = False


def scrape(url: str, model: Model, session: requests.Session, headers: typing.Dict[str, str]) -> None:
    for page in itertools.count(1):
        if not RUNNING:
            break
        req = session.get(f"{url}{page}", headers=headers)
        soup = BeautifulSoup(req.content, 'html.parser')

        for li in soup.find_all('li', class_="container"):
            title = li.find('h2').text
            price = li.find('p', class_="price-text").text
            print(f"Title: {title}, Price: {price}")
            model.insert_or_update(Data(address=title, price=price))

        time.sleep(random.randint(1, 5))


def run() -> None:
    url = "https://www.website1/page- , https://www.website1/page- "


    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
    }
    model, session = Model(), requests.Session()
    scrape(url, model, session, headers)


if __name__ == '__main__':
    signal.signal(signal.SIGINT, sigint_handler)
    run()

I think you just need an array and for loop:我认为你只需要一个数组和 for 循环:

def run() -> None:
    urls = ["https://www.website1/page-", "https://www.website2/page-"]
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) 
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
    }
    model, session = Model(), requests.Session()
    for url in urls:
        scrape(url, model, session, headers)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 美丽的汤,如何抓取多个网址并将其保存在csv文件中 - Beautiful soup, how to scrape multiple urls and save them in a csv file Python 网页抓取 | 如何通过选择页码作为使用 Beautiful Soup 和 selenium 的范围从多个 url 中抓取数据? - Python Web Scraping | How to scrape data from multiple urls by choosing page number as a range with Beautiful Soup and selenium? 使用Beautiful Soup刮取多个URL - Scrape Multiple URLs using Beautiful Soup 如何刮<Script> tag with Beautiful Soup BS4 (Python) - How to Scrape <Script> tag with Beautiful Soup BS4 (Python) Beautiful Soup Python 如何在多个 div 中抓取特定值 - Beautiful Soup Python How to scrape specific value in multiple divs 如何使用字典使用Beautiful Soup Python遍历多个URL - How to use a dictionary to loop through multiple urls with Beautiful Soup Python 使用.txt文件从多个网页中抓取数据,该文件包含带有Python和漂亮汤的URL - Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup 如何使用Beautiful Soup在python中抓取整个整数? - How to scrape entire integer in python with Beautiful Soup? 如何使用Python中的Beautiful Soup进行抓取? - How to scrape this using Beautiful Soup in Python? 如何用 python 和漂亮的汤抓取 CSS 图标 - How to Scrape CSS icon with python and beautiful soup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM