[英]Scrape html data using beautifulsoup and Python
I am trying to scrape school names from the following url: https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1 .我试图从以下网址中抓取学校名称: https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1 。
I want to scrape 10 pages, hence the for loop.我想刮 10 页,因此是 for 循环。 I have never used beautifulsoup before and the documentation hasn't solved my problem.
我以前从未使用过 beautifulsoup,文档也没有解决我的问题。 Ultimately, I want to scrape the since that's where the school names reside.
最终,我想刮掉,因为那是学校名称所在的地方。 Below is the small amount of code I have.
以下是我拥有的少量代码。 Any help would be extremely helpful!
任何帮助都会非常有帮助! Thanks!
谢谢!
import bs4 as bs
import requests
numbers = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
names = []
for number in numbers:
resp = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page='+number)
soup = bs.BeautifulSoup(resp.text, "lxml")
school_names = soup.find('div', {'class':'"search-results"'})
for school_name in school_names:
school = school_name.find('h2')
if school:
print (school.text)
Try this with passing the headers.通过传递标题试试这个。 Using https://curl.trillworks.com/ as a helper, I get:
使用https://curl.trillworks.com/作为帮手,我得到:
import requests
headers = {
'authority': 'fonts.gstatic.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
'sec-fetch-dest': 'font',
'accept': '*/*',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-user': '?1',
'accept-language': 'en-US,en;q=0.9',
'cookie': '_pxhd=120bcbd3ded2e33c1496a0ff505f52a169b1f9c1db59a881c1cd00495b9442ee:62dfdf81-5341-11ea-95d7-e144631f0943; xid=6fef7398-e61d-46d2-be72-ee8e8fecc13d; navigation=%7B%22location%22%3A%7s%22%3A%7B%22colleges%22%3A%22%2Fs%2Findiana%2F%22%2C%22graduate-schools%22%3A%22%2Fs%2Findiana%2F%22%2C%22k12%22%3A%22%2Fs%2Findiana%2F%22%2C%22places-to-live%22%3A%22%2Fs%2Findiana%2F%22%2C%22places-to-work%22%3A%22%2Fs%2Findiana%2F%22%7D%7D; experiments=%5E%5E%5E%24%5D; recentlyViewed=entityHistory%7CsearchHistory%7CentityName%7CIndiana%7CentityGuid%7Cad8b4b4c-f8d2-4015-8b22-c0f002a720bb%7CentityType%7CState%7CentityFragment%7Cindiana%5E%5E%5E%240%7C%40%5D%7C1%7C%40%242%7C3%7C4%7C5%7C6%7C7%7C8%7C9%5D%5D%5D; hintSeenLately=second_hint',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
'Sec-Fetch-Dest': 'image',
'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Mode': 'no-cors',
'Referer': 'https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1',
'Accept-Language': 'en-US,en;q=0.9',
'x-client-data': 'CI+2yQEIorbJAQjBtskBCKmdygEIy67KAQi8sMoBCJa1ygEIm7XKAQjstcoBCI66ygEIsL3KARirpMoB',
'referer': 'https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,600,700',
'origin': 'https://www.niche.com',
'Origin': 'https://www.niche.com',
}
params = (
('page', '1'),
)
response = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/', headers=headers, params=params)
#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('https://www.niche.com/k12/search/best-public-high-schools/s/indiana/?page=1', headers=headers)
This gives me a 200 now and not a 403. The above headers are verbose of course (I copied this from my browser), you could probably use trial-and-error to see which headers are actually required (I'm guessing it's only a handful) to guarantee a 200 OK
.这现在给了我 200 而不是 403。当然上面的标题是冗长的(我从我的浏览器复制了这个),你可能会使用试错法来查看实际需要哪些标题(我猜它只是少数)以保证
200 OK
。
The webpage you are trying to scrape has CAPTCHA which makes it difficult to collect data.您尝试抓取的网页具有验证码,这使得收集数据变得困难。 Take a look at this link:
看看这个链接:
https://sqa.stackexchange.com/questions/17022/how-to-fill-captcha-using-test-automation https://sqa.stackexchange.com/questions/17022/how-to-fill-captcha-using-test-automation
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.