繁体   English   中英

获取 AttributeError: 'NoneType' 对象没有属性 'text'(网络抓取)

[英]Getting AttributeError: 'NoneType' object has no attribute 'text' (web-scraping)

这是我关于网络抓取的案例研究。 我在最终代码中遇到了一个问题“NoneType”对象没有属性“text”,所以我试图用“getattr”函数修复它,但它没有用。

'''

import requests
from bs4 import BeautifulSoup

url = 'https://www.birdsnest.com.au/womens/dresses'

source = requests.get(url)
soup = BeautifulSoup(source.content, 'lxml')

'''

productlist= soup.find_all('div', id='items')

'''

productlinks = []
for item in productlist:
  for link in item.find_all('a',href=True):
      productlinks.append(url + link['href'])
print(len(productlinks))

'''

productlinks = []
for x in range(1,28):
  source = requests.get(f'https://www.birdsnest.com.au/womens/dresses?_lh=1&page={x}')
  soup = BeautifulSoup(source.content, 'lxml')
  for item in productlist:
      for link in item.find_all('a',href=True):
        productlinks.append(url + link['href'])
print(productlinks)

'''

for link in productlinks:
    source = requests.get(link)
    soup = BeautifulSoup(source.content, 'lxml')

    name = soup.find('h1',class_='item-heading__name').text.strip()
    price = soup.find('p',class_='item-heading__price').text.strip()
    feature = soup.find('div',class_='tab-accordion__content active').text.strip()

    sum = {
      'name':name,
      'price':price,
      'feature':feature
          }
    print(sum)

'''

  ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-d4d46558690d> in <module>()
      3     soup = BeautifulSoup(source.content, 'lxml')
      4 
----> 5     name = soup.find('h1',class_='item-heading__name').text.strip()
      6     price = soup.find('p',class_='item-heading__price').text.strip()
      7     feature = soup.find('div',class_='tab-accordion__content active').text.strip()

AttributeError: 'NoneType' object has no attribute 'text'

---------------------------------------------------------------------------

所以我试图用这种方法修复,但它没有用。

 for link in productlinks:
    source = requests.get(link)
    soup = BeautifulSoup(source.content, 'lxml')

    name = getattr(soup.find('h1',class_='item-heading__name'),'text',None)
    price = getattr(soup.find('p',class_='item-heading__price'),'text',None)
    feature = getattr(soup.find('div',class_='tab-accordion__content active'),'text',None)

    sum = {
      'name':name,
      'price':price,
      'feature':feature
          }
    print(sum)

这是输出。 它只显示“Nonetype”

{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}

首先,始终为您正在抓取的页面关闭JS 然后你会意识到标签类发生了变化,这些是你想要定位的。

此外,在循环浏览页面时,不要忘记 Python 的range()停止值不包括在内。 意思是,这个range(1, 28)将在第27页停止。

这是我将如何去做:

import json

import requests
from bs4 import BeautifulSoup


cookies = {
    "ServerID": "1033",
    "__zlcmid": "10tjXhWpDJVkUQL",
}

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}


def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
    return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]


all_pages = []
for page in range(1, 29):
    print(f"Scraping data from page {page}...")

    current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
    source = requests.get(current_page, headers=headers, cookies=cookies)
    soup = BeautifulSoup(source.content, 'html.parser')

    brand = extract_info(soup, tag="strong", attr_value="brand")
    name = extract_info(soup, tag="h2", attr_value="name")
    price = extract_info(soup, tag="span", attr_value="price")

    all_pages.extend(
        [
            {
                "brand": b,
                "name": n,
                "price": p,
            } for b, n, p in zip(brand, name, price)
        ]
    )

print(f"{all_pages}\nFound: {len(all_pages)} dresses.")

with open("all_the_dresses2.json", "w") as jf:
    json.dump(all_pages, jf, indent=4)

这将为您提供包含所有连衣裙的JSON

    {
        "brand": "boho bird",
        "name": "Prissy Dress",
        "price": "$189.95"
    },
    {
        "brand": "boho bird",
        "name": "Dandelion Dress",
        "price": "$139.95"
    },
    {
        "brand": "Lula Soul",
        "name": "Dandelion Dress",
        "price": "$179.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Cotton V-Neck A-Line Splice Dress",
        "price": "$149.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Lenny Pinafore",
        "price": "$139.95"
    },
and so on for the next 28 pages ...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM