简体   繁体   中英

how to scrape multiple pages in python with bs4

I have a query as I have been scraping a website "https://www.zaubacorp.com/company-list" as not able to scrape the email id from the given link in the table. Although the need to scrape Name, Email and Directors from the link in the given table. Can anyone please, resolve my issue as I am a newbie to web scraping using python with beautiful soup and requests.

Thank You Dieksha

 #Scraping the website
#Import a liabry to query a website
import requests
#Specify the URL
companies_list = "https://www.zaubacorp.com/company-list"
link = requests.get("https://www.zaubacorp.com/company-list").text
#Import BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(link,'lxml')
soup.table.find_all('a')
all_links = soup.table.find_all('a')
for link in all_links:
    print(link.get("href"))

Well let's break down the website and see what we can do.

First off, I can see that this website is paginated. This means that we have to deal with something as simple as the website using part of the GET query string to determine what page we are requesting to some AJAX call that is filling the table with new data when you click next. From clicking on the next page and subsequent pages, we are in some luck that the website uses the GET query parameter.

Our URL for requesting the webpage to scrape is going to be

https://www.zaubacorp.com/company-list/p-<page_num>-company.html

We are going to write a bit of code that will fill that page num with values ranging from 1 to the last page you want to scrape. In this case, we do not need to do anything special to determine the last page of the table since we can skip to the end and find that it will be page 13,333. This means that we will be making 13,333 page requests to this website to fully collect all of its data.

As for gathering the data from the website we will need to find the table that holds the information and then iteratively select the elements to pull out the information.

In this case we can actually "cheat" a little since there appears to be only a single tbody on the page. We want to iterate over all the and pull out the text. I'm going to go ahead and write the sample.

import requests
import bs4

def get_url(page_num):
    page_num = str(page_num)
    return "https://www.zaubacorp.com/company-list/p-1" + page_num + "-company.html"

def scrape_row(tr):
    return [td.text for td in tr.find_all("td")]

def scrape_table(table):
    table_data = []
    for tr in table.find_all("tr"):
        table_data.append(scrape_row(tr))
    return table_data

def scrape_page(page_num):
    req = requests.get(get_url(page_num))
    soup = bs4.BeautifulSoup(req.content, "lxml")
    data = scrape_table(soup)
    for line in data:
        print(line)

for i in range(1, 3):
    scrape_page(i)

This code will scrape the first two pages of the website and by just changing the for loop range you can get all 13,333 pages. From here you should be able to just modify the printout logic to save to a CSV.

运行代码的输出

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM