简体   繁体   中英

How to Crawl Multiple pages/cities from a website (BeautifulSoup,Requests,Python3)

I'm wondering how to crawl multiple different pages/cities from one website using beautiful soup/requests without having to repeat my code over and over.

Here is my code right now:

Region = "Marrakech"
Spider = 20

def trade_spider(max_pages):
    page = -1

    partner_ID = 2
    location_ID = 25

    already_printed = set()

    while page <= max_pages:
        page += 1
        response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(Region) +"&page=" + str(page))
        jsondata = json.loads(response.read().decode("utf-8"))
        format = (jsondata['activities'])
        g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
        soup = BeautifulSoup(g_data)



        hallo = soup.find_all("article", {"class": "activity-card"})

        for item in hallo:
            headers = item.find_all("h3", {"class": "activity-card"})
            for header in headers:
                header_final = header.text.strip()
                if header_final not in already_printed:
                    already_printed.add(header_final)

            deeplinks = item.find_all("a", {"class": "activity"})
            for t in set(t.get("href") for t in deeplinks):
                deeplink_final = t
                if deeplink_final not in already_printed:
                    already_printed.add(deeplink_final)

            end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
            print(end_final)

 trade_spider(int(Spider))

My goal is to ideally crawl multiple cities/regions from one particular website.

Now, I can do this manually by repeating my code over and over and crawling each individual website and then concatenating my results for each of these dataframes together but that seems very unpythonic. I was wondering if anyone had a faster way or any advice?

I tried to add a second city into my region tag, but does not work

Region = "Marrakech","London"

Can anyone help me with that? Any feedback is appreciated.

Region = ["Marrakech","London"]

Put your while loop inside the for loop, then reset pages to -1.

for reg in Region:
   pages = -1

and replace Region with reg while requesting url.

Region = ["Marrakech","London"]    
Spider = 20

def trade_spider(max_pages):

    partner_ID = 2
    location_ID = 25
    already_printed = set()
    for reg in Region:
        page = -1  
        while page <= max_pages:
            page += 1
            response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(reg) +"&page=" + str(page))
            jsondata = json.loads(response.read().decode("utf-8"))
            format = (jsondata['activities'])
            g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
            soup = BeautifulSoup(g_data)



            hallo = soup.find_all("article", {"class": "activity-card"})

            for item in hallo:
                headers = item.find_all("h3", {"class": "activity-card"})
                for header in headers:
                    header_final = header.text.strip()
                    if header_final not in already_printed:
                        already_printed.add(header_final)

                deeplinks = item.find_all("a", {"class": "activity"})
                for t in set(t.get("href") for t in deeplinks):
                    deeplink_final = t
                    if deeplink_final not in already_printed:
                        already_printed.add(deeplink_final)

                end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
                print(end_final)
trade_spider(int(Spider))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM