简体   繁体   English

如何从网站抓取多个页面/城市(BeautifulSoup,Requests,Python3)

[英]How to Crawl Multiple pages/cities from a website (BeautifulSoup,Requests,Python3)

I'm wondering how to crawl multiple different pages/cities from one website using beautiful soup/requests without having to repeat my code over and over. 我想知道如何使用漂亮的汤/请求从一个网站抓取多个不同的页面/城市,而不必一遍又一遍地重复我的代码。

Here is my code right now: 现在是我的代码:

Region = "Marrakech"
Spider = 20

def trade_spider(max_pages):
    page = -1

    partner_ID = 2
    location_ID = 25

    already_printed = set()

    while page <= max_pages:
        page += 1
        response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(Region) +"&page=" + str(page))
        jsondata = json.loads(response.read().decode("utf-8"))
        format = (jsondata['activities'])
        g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
        soup = BeautifulSoup(g_data)



        hallo = soup.find_all("article", {"class": "activity-card"})

        for item in hallo:
            headers = item.find_all("h3", {"class": "activity-card"})
            for header in headers:
                header_final = header.text.strip()
                if header_final not in already_printed:
                    already_printed.add(header_final)

            deeplinks = item.find_all("a", {"class": "activity"})
            for t in set(t.get("href") for t in deeplinks):
                deeplink_final = t
                if deeplink_final not in already_printed:
                    already_printed.add(deeplink_final)

            end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
            print(end_final)

 trade_spider(int(Spider))

My goal is to ideally crawl multiple cities/regions from one particular website. 我的目标是理想地从一个特定的网站爬网多个城市/地区。

Now, I can do this manually by repeating my code over and over and crawling each individual website and then concatenating my results for each of these dataframes together but that seems very unpythonic. 现在,我可以手动执行此操作,一遍又一遍地重复我的代码并爬网每个单独的网站,然后将每个这些数据框的结果串联在一起,但这似乎很不合常规。 I was wondering if anyone had a faster way or any advice? 我想知道是否有人有更快的方法或任何建议?

I tried to add a second city into my region tag, but does not work 我试图将第二个城市添加到我的区域标签中,但是不起作用

Region = "Marrakech","London"

Can anyone help me with that? 有人可以帮我吗? Any feedback is appreciated. 任何反馈表示赞赏。

Region = ["Marrakech","London"]

Put your while loop inside the for loop, then reset pages to -1. 将while循环放入for循环中,然后将页面重置为-1。

for reg in Region:
   pages = -1

and replace Region with reg while requesting url. 并在请求网址时用reg替换Region。

Region = ["Marrakech","London"]    
Spider = 20

def trade_spider(max_pages):

    partner_ID = 2
    location_ID = 25
    already_printed = set()
    for reg in Region:
        page = -1  
        while page <= max_pages:
            page += 1
            response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(reg) +"&page=" + str(page))
            jsondata = json.loads(response.read().decode("utf-8"))
            format = (jsondata['activities'])
            g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
            soup = BeautifulSoup(g_data)



            hallo = soup.find_all("article", {"class": "activity-card"})

            for item in hallo:
                headers = item.find_all("h3", {"class": "activity-card"})
                for header in headers:
                    header_final = header.text.strip()
                    if header_final not in already_printed:
                        already_printed.add(header_final)

                deeplinks = item.find_all("a", {"class": "activity"})
                for t in set(t.get("href") for t in deeplinks):
                    deeplink_final = t
                    if deeplink_final not in already_printed:
                        already_printed.add(deeplink_final)

                end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
                print(end_final)
trade_spider(int(Spider))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何抓取多个网站以查找常见单词(BeautifulSoup,Requests,Python3) - How to Crawl Multiple Websites to find common Words (BeautifulSoup,Requests,Python3) python爬行beautifulsoup如何爬行几个页面? - python crawling beautifulsoup how to crawl several pages? 如何在 Python BeautifulSoup 中抓取网站中的每个页面 - How to crawl every page in a website in Python BeautifulSoup 如何使用请求和 python 中的 beautifulsoup 对网站的所有页面进行分页 - How to paginate through all the pages of a website using requests and beautifulsoup in python 使用 beautifulsoup 从网站抓取多个网页,请求在 Python - Webscrape multiple webpages from a website using beautifulsoup, requests in Python 如何抓取href-Python和beautifulsoup - How to crawl href - Python & beautifulsoup 如何使用 BeautifulSoup 抓取网站中的每个页面 - How to crawl every page in a website using BeautifulSoup 如何使用不变的网址以网络方式刮擦具有不同城市的多个页面-Python 3 - How to web scrape multiple pages with different cities in unchanging URL - Python 3 如何使用python和beautifulsoup4循环抓取网站中多个页面的数据 - How can I loop scraping data for multiple pages in a website using python and beautifulsoup4 如何使用 Python 和 BeautifulSoup 抓取多个谷歌页面 - How to scrape multiple google pages with Python and BeautifulSoup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM