使用BeautifulSoup抓取网站

Question

Im getting attribute error while scraping 我在抓取时遇到属性错误

import urllib2
from bs4 import BeautifulSoup

quote_page ='https://www.bloomberg.com/quote/SPX:IND'
page = urllib2.urlopen(quote_page)

soup = BeautifulSoup(page,'html.parser')

name_box = soup.find('h1', attires ={'class': 'name'})

name = name_box.text.strip()
print name

Traceback (most recent call last): 追溯（最近一次通话）：

File "word1.py", line 11, in 在第11行的文件“ word1.py”中
 name = name_box.text.strip() 
AttributeError: 'NoneType' object has no attribute 'text' AttributeError：“ NoneType”对象没有属性“ text”

Viveks-MacBook-Pro:py vivek$ Viveks-MacBook-Pro：py vivek $

Answer 1

when you do this 当你这样做

print(name_box)

you will get 你会得到

 None
Traceback (most recent call last):
  File "C:/Users/devsurya/python/demo programs/b4s.py", line 13, in <module>
    name = name_box.text.strip()
AttributeError: 'NoneType' object has no attribute 'text'

and when you do this - 当您这样做时-

print(soup)    ## it says following message with weird html and css

We've detected unusual activity from your computer network 我们检测到您计算机网络中的异常活动

and soup.find('h1', attires ={'class': 'name'}) should be soup.find('h1', {'class': 'companyName__99a4824b'}) 和soup.find('h1', attires ={'class': 'name'})应该是soup.find('h1', {'class': 'companyName__99a4824b'})

Answer 2

Assuming you want the company name I would go with requests and a couple of headers are required (you will need to test to see if this performs consistently over time). 假设您想要公司名称，我将随请求一起使用，并且需要几个标头（您将需要进行测试，以查看其是否随着时间的推移始终保持一致）。 I use a css attribute = value selector to get the appropriate element and use starts with operator ^ in case the value is dynamic ie I assume a constant start string of companyName . 我使用css attribute = value选择器来获取适当的元素，并使用以运算符^开头的情况（如果值是动态的），即我假设companyName起始字符串为常数。 This makes it more versatile for other requests. 这使其对于其他请求更具通用性。

import requests
from bs4 import BeautifulSoup as bs

quote_page ='https://www.bloomberg.com/quote/SPX:IND'
page = requests.get(quote_page, headers = {'User-Agent':'Mozilla/5.0', 'accept-language':'en-US,en;q=0.9'})
soup = bs(page.content,'lxml')
name_box = soup.select_one('[class^=companyName]')
name = name_box.text.strip()
print(name)

使用BeautifulSoup抓取网站

问题描述

2 个解决方案

解决方案1
1 2019-08-07 20:33:43

解决方案2
0 2019-08-07 20:35:53

使用BeautifulSoup抓取网站

问题描述

2 个解决方案

解决方案1 1 2019-08-07 20:33:43

解决方案2 0 2019-08-07 20:35:53

解决方案1
1 2019-08-07 20:33:43

解决方案2
0 2019-08-07 20:35:53