简体   繁体   English

web 使用漂亮的汤刮时结果不一致

[英]Inconsistent results while web scraping using beautiful soup

I am having an inconsistent issue that is driving me crazy.我遇到了一个让我发疯的不一致问题。 I am trying to scrape data about rental units.我正在尝试抓取有关出租单元的数据。 Let's say we have a webpage with 42 ads, the code works just fine for only 19 ads then it returns:假设我们有一个包含 42 个广告的网页,该代码仅适用于 19 个广告,然后返回:

Traceback (most recent call last):
  File "main.py", line 53, in <module>
    title = real_state_title.div.h1.text.strip()
AttributeError: 'NoneType' object has no attribute 'div'

If you started the code to process ads starting from a different ad number, let's say 5, it will also process the first 19 ads then raises the same error!如果您开始处理从不同广告编号开始的广告的代码,比如说 5,它也会处理前 19 个广告,然后引发相同的错误!

Here is a minimum code to show the issue I am having.这是显示我遇到的问题的最小代码。 Please note that this code will print the HTML for a functioning ad and also for the one with the error.请注意,此代码将打印 HTML 用于正常运行的广告以及出现错误的广告。 What is printed is so different.打印出来的东西太不一样了。

Run the code then change the value of i to see the results.运行代码然后更改i的值以查看结果。

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client
import traceback


page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"

# opens the connection and downloads html page from url
uClient = uReq(page_url)

# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()

# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})

# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')

print_functioning_ad = True

# Loop throw ads
i = 1  # change to start from a different ad (don't put zero)

for container in containers[i:]:
    print(f'Ad No.: {i}\n')
    i += 1

    # Get the link for this specific ad
    ad_link_container = container.find('div', {'class': 'title'})
    ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
    print(ad_link)

    single_ad = uReq(ad_link)

    # parses html into a soup data structure to traverse html
    page_soup2 = soup(single_ad.read(), "html.parser")
    single_ad.close()

    # Title
    real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})

    # Print one functioning ad html
    if print_functioning_ad:
        print_functioning_ad = False
        print(page_soup2)

    print('real state title type', type(real_state_title))

    try:
        title = real_state_title.div.h1.text.strip()
        print(title)
    except Exception:
        print(traceback.format_exc())
        print(page_soup2)
        break

    print('____________________________________________________________')

Edit 1:编辑1:

In my simple example I want to loop through each ad in the provided link, open it, and get the title.在我的简单示例中,我想遍历提供的链接中的每个广告,打开它并获取标题。 In my actual code I am not only getting the title but also every other info about the ad.在我的实际代码中,我不仅获得了标题,还获得了有关广告的所有其他信息。 So I need to load the data from the link associated with every ad.所以我需要从与每个广告关联的链接中加载数据。 My code actually does that, but for an unknown reason, this happens ONLY for 19 ads regardless which one I started with.我的代码实际上是这样做的,但由于未知原因,这仅发生在 19 个广告中,无论我从哪个广告开始。 This is driving my nuts!这让我发疯了!

To get all pages from the URL you can use next example:要从 URL 获取所有页面,您可以使用下一个示例:

import requests
from bs4 import BeautifulSoup


page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"

page = 1
while True:
    print("Page {}...".format(page))
    print("-" * 80)
    soup = BeautifulSoup(requests.get(page_url).content, "html.parser")

    for i, a in enumerate(soup.select("a.title"), 1):
        print(i, a.get_text(strip=True))

    next_url = soup.select_one('a[title="Next"]')
    if not next_url:
        break

    print()

    page += 1
    page_url = "https://www.kijiji.ca" + next_url["href"]

Prints:印刷:

Page 1...
--------------------------------------------------------------------------------
1 Spacious One Bedroom Apartment
2 3 Bedroom Quispamsis
3 Uptown-two-bedroom apartment for rent - all-inclusive
4 New Construction!! Large 2 Bedroom Executive Apt
5 LARGE 1 BEDROOM UPTOWN $850 HEAT INCLUDED AVAIABLE JULY 1
6 84 Wright St Apt 2
7 310 Woodward Ave (Brentwood Tower) Condo #1502

...

Page 5...
--------------------------------------------------------------------------------
1 U02 - CHFR - Cozy 1 Bedroom + Den - WEST SAINT JOHN
2 2+ Bedroom Historic Renovated Stainless Kitchen
3 2 Bedroom Apartment - 343 Prince Street West
4 2 Bedroom 5th Floor Loft Apartment in South End Saint John
5 Bay of Fundy view from luxury 5th floor 1 bedroom + den suite
6 Suites of The Atlantic - Renting for Fall 2021: 2 bedrooms
7 WOODWARD GARDENS//2 BR/$945 + LIGHTS//MAY//MILLIDGEVILLE//JULY
8 HEATED & SMOKE FREE - Bach & 1Bd Apt - 50% off 1st month's rent
9 Beautiful 2 bedroom apartment in Millidgeville
10 Spacious 2 bedroom in Uptown Saint John
11 3 bedroom apartment at Millidge Ave close to university ave
12 Big Beautiful 3 bedroom apt. in King Square
13 NEWER HARBOURVIEW SUITES UNFURNISHED OR FURNISHED /BLUE ROCK
14 Rented
15 Completely Renovated - 1 Bedroom Condo w/ small den Brentwood
16 1+1 Bedroom Apartment for rent for 2 persons
17 3 large bedroom apt. in King Street East Saint John,NB
18 Looking for a house
19 Harbour View 2 Bedroom Apartment
20 Newer Harbourview suites unfurnished or furnished /Blue Rock Ct
21 LOVELY 2 BEDROOM APARTMENT FOR LEASE 5 WOODHOLLOW PARK EAST SJ

I think I figured out the problem here.我想我在这里找到了问题所在。 I seems like you can't make a lot of requests in a short period of time, so I added a try: except: statement where a time sleep of 80 second is issued when this error occurs, this fixed my problem!我好像不能在短时间内发出很多请求,所以我添加了一个try: except:语句,当这个错误发生时,会发出 80 秒的时间睡眠,这解决了我的问题!

You may want to change the sleep time period to a different value depends on the website you are trying to scrape from.您可能希望将睡眠时间段更改为不同的值,具体取决于您尝试从中抓取的网站。

Here is the modified code:这是修改后的代码:

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client
import traceback
import time


page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"

# opens the connection and downloads html page from url
uClient = uReq(page_url)

# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()

# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})

# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')

print_functioning_ad = True

# Loop throw ads
i = 1  # change to start from a different ad (don't put zero)

for container in containers[i:]:
    print(f'Ad No.: {i}\n')
    i = i + 1

    # Get the link for this specific ad
    ad_link_container = container.find('div', {'class': 'title'})
    ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
    print(ad_link)

    single_ad = uReq(ad_link)

    # parses html into a soup data structure to traverse html
    page_soup2 = soup(single_ad.read(), "html.parser")
    single_ad.close()

    # Title
    real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
    try:
        title = real_state_title.div.h1.text.strip()
        print(title)
    except AttributeError:
        print(traceback.format_exc())
        i = i - 1
        t = 80
        print(f'----------------------------Sleep for {t} seconds!')
        time.sleep(t)
        continue

    print('____________________________________________________________')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM