获取 AttributeError: 'NoneType' 对象没有属性 'text'（网络抓取）

Question

This is my case study about web scraping.这是我关于网络抓取的案例研究。 I got a problem in the final code 'NoneType' object has no attribute 'text' so I tried to fix it with 'getattr' function but it didn't work.我在最终代码中遇到了一个问题“NoneType”对象没有属性“text”，所以我试图用“getattr”函数修复它，但它没有用。

''' '''

import requests
from bs4 import BeautifulSoup

url = 'https://www.birdsnest.com.au/womens/dresses'

source = requests.get(url)
soup = BeautifulSoup(source.content, 'lxml')

''' '''

productlist= soup.find_all('div', id='items')

''' '''

productlinks = []
for item in productlist:
  for link in item.find_all('a',href=True):
      productlinks.append(url + link['href'])
print(len(productlinks))

''' '''

productlinks = []
for x in range(1,28):
  source = requests.get(f'https://www.birdsnest.com.au/womens/dresses?_lh=1&page={x}')
  soup = BeautifulSoup(source.content, 'lxml')
  for item in productlist:
      for link in item.find_all('a',href=True):
        productlinks.append(url + link['href'])
print(productlinks)

''' '''

for link in productlinks:
    source = requests.get(link)
    soup = BeautifulSoup(source.content, 'lxml')

    name = soup.find('h1',class_='item-heading__name').text.strip()
    price = soup.find('p',class_='item-heading__price').text.strip()
    feature = soup.find('div',class_='tab-accordion__content active').text.strip()

    sum = {
      'name':name,
      'price':price,
      'feature':feature
          }
    print(sum)

''' '''

  ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-d4d46558690d> in <module>()
      3     soup = BeautifulSoup(source.content, 'lxml')
      4 
----> 5     name = soup.find('h1',class_='item-heading__name').text.strip()
      6     price = soup.find('p',class_='item-heading__price').text.strip()
      7     feature = soup.find('div',class_='tab-accordion__content active').text.strip()

AttributeError: 'NoneType' object has no attribute 'text'

---------------------------------------------------------------------------

So I tried to fix with this method, but it didn't work.所以我试图用这种方法修复，但它没有用。

 for link in productlinks:
    source = requests.get(link)
    soup = BeautifulSoup(source.content, 'lxml')

    name = getattr(soup.find('h1',class_='item-heading__name'),'text',None)
    price = getattr(soup.find('p',class_='item-heading__price'),'text',None)
    feature = getattr(soup.find('div',class_='tab-accordion__content active'),'text',None)

    sum = {
      'name':name,
      'price':price,
      'feature':feature
          }
    print(sum)

This is the output.这是输出。 It show only 'Nonetype'它只显示“Nonetype”

{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}

Answer 1

First of all, always turn JS off for the page you're scraping.首先，始终为您正在抓取的页面关闭JS 。 Then you'll realize that tag classes change and these are the ones you want to target.然后你会意识到标签类发生了变化，这些是你想要定位的。

Also, when looping through the pages, don't forget that Python's range() stop value is not inclusive.此外，在循环浏览页面时，不要忘记 Python 的range()停止值不包括在内。 Meaning, this range(1, 28) will stop on page 27 .意思是，这个range(1, 28)将在第27页停止。

Here's how I would go about it:这是我将如何去做：

import json

import requests
from bs4 import BeautifulSoup


cookies = {
    "ServerID": "1033",
    "__zlcmid": "10tjXhWpDJVkUQL",
}

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}


def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
    return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]


all_pages = []
for page in range(1, 29):
    print(f"Scraping data from page {page}...")

    current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
    source = requests.get(current_page, headers=headers, cookies=cookies)
    soup = BeautifulSoup(source.content, 'html.parser')

    brand = extract_info(soup, tag="strong", attr_value="brand")
    name = extract_info(soup, tag="h2", attr_value="name")
    price = extract_info(soup, tag="span", attr_value="price")

    all_pages.extend(
        [
            {
                "brand": b,
                "name": n,
                "price": p,
            } for b, n, p in zip(brand, name, price)
        ]
    )

print(f"{all_pages}\nFound: {len(all_pages)} dresses.")

with open("all_the_dresses2.json", "w") as jf:
    json.dump(all_pages, jf, indent=4)

This gets you a JSON with all the dresses.这将为您提供包含所有连衣裙的JSON 。

    {
        "brand": "boho bird",
        "name": "Prissy Dress",
        "price": "$189.95"
    },
    {
        "brand": "boho bird",
        "name": "Dandelion Dress",
        "price": "$139.95"
    },
    {
        "brand": "Lula Soul",
        "name": "Dandelion Dress",
        "price": "$179.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Cotton V-Neck A-Line Splice Dress",
        "price": "$149.95"
    },
    {
        "brand": "Honeysuckle Beach",
        "name": "Lenny Pinafore",
        "price": "$139.95"
    },
and so on for the next 28 pages ...

获取 AttributeError: 'NoneType' 对象没有属性 'text'（网络抓取）

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-10-28 07:40:17

获取 AttributeError: &#39;NoneType&#39; 对象没有属性 &#39;text&#39;（网络抓取）

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-10-28 07:40:17

获取 AttributeError: 'NoneType' 对象没有属性 'text'（网络抓取）

解决方案1
0 已采纳 2020-10-28 07:40:17