[英]Getting AttributeError: 'NoneType' object has no attribute 'text' (web-scraping)
This is my case study about web scraping.这是我关于网络抓取的案例研究。 I got a problem in the final code 'NoneType' object has no attribute 'text' so I tried to fix it with 'getattr' function but it didn't work.
我在最终代码中遇到了一个问题“NoneType”对象没有属性“text”,所以我试图用“getattr”函数修复它,但它没有用。
''' '''
import requests
from bs4 import BeautifulSoup
url = 'https://www.birdsnest.com.au/womens/dresses'
source = requests.get(url)
soup = BeautifulSoup(source.content, 'lxml')
''' '''
productlist= soup.find_all('div', id='items')
''' '''
productlinks = []
for item in productlist:
for link in item.find_all('a',href=True):
productlinks.append(url + link['href'])
print(len(productlinks))
''' '''
productlinks = []
for x in range(1,28):
source = requests.get(f'https://www.birdsnest.com.au/womens/dresses?_lh=1&page={x}')
soup = BeautifulSoup(source.content, 'lxml')
for item in productlist:
for link in item.find_all('a',href=True):
productlinks.append(url + link['href'])
print(productlinks)
''' '''
for link in productlinks:
source = requests.get(link)
soup = BeautifulSoup(source.content, 'lxml')
name = soup.find('h1',class_='item-heading__name').text.strip()
price = soup.find('p',class_='item-heading__price').text.strip()
feature = soup.find('div',class_='tab-accordion__content active').text.strip()
sum = {
'name':name,
'price':price,
'feature':feature
}
print(sum)
''' '''
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-d4d46558690d> in <module>()
3 soup = BeautifulSoup(source.content, 'lxml')
4
----> 5 name = soup.find('h1',class_='item-heading__name').text.strip()
6 price = soup.find('p',class_='item-heading__price').text.strip()
7 feature = soup.find('div',class_='tab-accordion__content active').text.strip()
AttributeError: 'NoneType' object has no attribute 'text'
---------------------------------------------------------------------------
So I tried to fix with this method, but it didn't work.所以我试图用这种方法修复,但它没有用。
for link in productlinks:
source = requests.get(link)
soup = BeautifulSoup(source.content, 'lxml')
name = getattr(soup.find('h1',class_='item-heading__name'),'text',None)
price = getattr(soup.find('p',class_='item-heading__price'),'text',None)
feature = getattr(soup.find('div',class_='tab-accordion__content active'),'text',None)
sum = {
'name':name,
'price':price,
'feature':feature
}
print(sum)
This is the output.这是输出。 It show only 'Nonetype'
它只显示“Nonetype”
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
{'name': None, 'price': None, 'feature': None}
First of all, always turn JS
off for the page you're scraping.首先,始终为您正在抓取的页面关闭
JS
。 Then you'll realize that tag classes change and these are the ones you want to target.然后你会意识到标签类发生了变化,这些是你想要定位的。
Also, when looping through the pages, don't forget that Python's range()
stop value is not inclusive.此外,在循环浏览页面时,不要忘记 Python 的
range()
停止值不包括在内。 Meaning, this range(1, 28)
will stop on page 27
.意思是,这个
range(1, 28)
将在第27
页停止。
Here's how I would go about it:这是我将如何去做:
import json
import requests
from bs4 import BeautifulSoup
cookies = {
"ServerID": "1033",
"__zlcmid": "10tjXhWpDJVkUQL",
}
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}
def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]
all_pages = []
for page in range(1, 29):
print(f"Scraping data from page {page}...")
current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
source = requests.get(current_page, headers=headers, cookies=cookies)
soup = BeautifulSoup(source.content, 'html.parser')
brand = extract_info(soup, tag="strong", attr_value="brand")
name = extract_info(soup, tag="h2", attr_value="name")
price = extract_info(soup, tag="span", attr_value="price")
all_pages.extend(
[
{
"brand": b,
"name": n,
"price": p,
} for b, n, p in zip(brand, name, price)
]
)
print(f"{all_pages}\nFound: {len(all_pages)} dresses.")
with open("all_the_dresses2.json", "w") as jf:
json.dump(all_pages, jf, indent=4)
This gets you a JSON
with all the dresses.这将为您提供包含所有连衣裙的
JSON
。
{
"brand": "boho bird",
"name": "Prissy Dress",
"price": "$189.95"
},
{
"brand": "boho bird",
"name": "Dandelion Dress",
"price": "$139.95"
},
{
"brand": "Lula Soul",
"name": "Dandelion Dress",
"price": "$179.95"
},
{
"brand": "Honeysuckle Beach",
"name": "Cotton V-Neck A-Line Splice Dress",
"price": "$149.95"
},
{
"brand": "Honeysuckle Beach",
"name": "Lenny Pinafore",
"price": "$139.95"
},
and so on for the next 28 pages ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.