简体   繁体   中英

Web Scraping - Extract list of text from multiple pages

I want to extract a list of names from multiple pages of a website. The website has over 200 pages and i want to save all the names to a text file. I have wrote some code but it's giving me index error.

CODE:

    import requests
    from bs4 import BeautifulSoup as bs
    
    URL = 'https://hamariweb.com/names/muslim/boy/page-'
    
    #for page in range(1, 203):
        
    page = 1
    req = requests.get(URL + str(page))
    soup = bs(req.text, 'html.parser')
    row = soup.find('div', attrs={'class', 'row'})
    books = row.find_all('a')
    
    for book in books:
        data = book.find_all('b')[0].get_text()
        print(data)

OUTPUT:

Aabbaz

Aabid

Aabideen

Aabinus

Aadam

Aadeel

Aadil

Aadroop

Aafandi

Aafaq

Aaki

Aakif

Aalah

Aalam

Aalamgeer

Aalif
Traceback (most recent call last):
  File "C:\Users\Mujtaba\Documents\names.py", line 15, in <module>
    data = book.find_all('b')[0].get_text()
IndexError: list index out of range
>>>

I suggest to change your parser to html5lib #pip install html5lib . I just think it's better. Second It's better NOT to do a .find() from your soup object DIRECTLY since it might cause some problems where the tags and classes might have duplicates. SO you might be finding data on a html tag where your data isn't even there. So it's better to check everything and inspect element the the tags you want to get and see on what block of code they might be in cause it is easier that way to scrape, also to avoid more errors.

What I did there is I inspected the elements first and FIND the BLOCK of code where you want to get your data and I found that it is on a div and its class is mb-40 content-box that is where all the names you are trying to get are. Luckily the class is UNIQUE and there are no other elements with the same tag and class so we can just directly .find() it.

样本

Then the value of trs are simply the tr tags inside of that block

(Take note also that those <tr> tags are inside of a <table> tag but the good thing is those are the only <tr> tags that exist so there wouldn't be much of a problem like if there would be another <table> tag with the same class value)

which the <tr> tags contains the names you want to get. You may ask why is there [1:] it's because to start at index 1 to NOT include the Header from the table on the website.

Then just loop through those tr tags and get the text. With regards to your error on why is it happening it is simply because of index out of range you are trying to access a .find_all() result list item where it is out of bounds and this might happen if cases that there are no such data that is being found and that also might happen if you DIRECTLY do a .find() function on your soup variable, because there would be times that there are tags and their respective class values are the same BUT! WITH DIFFERENT CONTENT WITHIN IT . So what happens is you're expecting to scrape that particular part of the website but what actually happening is you're scraping a different part, that's why you might not get any data and wonder why it is happening.

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://hamariweb.com/names/muslim/boy/page-'

#for page in range(1, 203):
    
page = 1
req = requests.get(URL + str(page))
soup = bs(req.content, 'html5lib')
div_container = soup.find('div', class_='mb-40 content-box')

trs = div_container.find_all("tr",class_="bottom-divider")[1:]


for tr in trs:
    text = tr.find("td").find("a").text
    print(text)

The issue you're having with the IndexError means that in this case the b-tag you found doesn't contains the information that you are looking for.

You can simply wrap that piece of code in a try-except clause.

for book in books:
    try:
        data = book.find_all('b')[0].get_text()
        print(data)
        # Add data to the all_titles list
        all_titles.append(data)
    except IndexError:
        pass  # There was no element available

This will catch you error and move on. But not break the code.

Below I have also added some extra lines to save your title to a text-file. Take a look at the inline comments.

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://hamariweb.com/names/muslim/boy/page-'
# Theres is where your titles will be saved.  Changes as needed
PATH = '/tmp/title_file.txt'
    
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')

# Here your title will be stored before writing to file
all_titles = []

for book in books:
    try:
        # Add strip() to cleanup the input
        data = book.find_all('b')[0].get_text().strip()
        print(data)
        # Add data to the all_titles list
        all_titles.append(data)
    except IndexError:
        pass  # There was no element available


# Open path to write
with open(PATH, 'w') as f:
    # Write all titles on a new line
    f.write('\n'.join(all_titles))

The reason for getting the error is since it can't find a <b> tag.

Try this code to request each page and save the data to a file:

import requests
from bs4 import BeautifulSoup as bs

MAIN_URL = "https://hamariweb.com/names/muslim/boy/"
URL = "https://hamariweb.com/names/muslim/boy/page-{}"

with open("output.txt", "a", encoding="utf-8") as f:

    for page in range(203):
        if page == 0:
            req = requests.get(MAIN_URL.format(page))
        else:
            req = requests.get(URL.format(page))

        soup = bs(req.text, "html.parser")
        print(f"page # {page}, Getting: {req.url}")

        book_name = (
            tag.get_text(strip=True)
            for tag in soup.select(
                "tr.bottom-divider:nth-of-type(n+2) td:nth-of-type(1)"
            )
        )
        f.seek(0)
        f.write("\n".join(book_name) + "\n")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM