简体   繁体   中英

Python - BeautifulSoup fails to web-scrape bloomberg info upon frequent requests

I was trying to use BeautifulSoup to get the sector, industry and sub-industry from Bloomberg using the codes below:

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

headers = {'User-Agent': 'Mozilla/80.0'}

response = requests.get('https://www.bloomberg.com/profile/company/VRTU:US', headers = headers)
content = response.content
parser = BeautifulSoup(content, 'html.parser')

sector = parser.findAll('div', class_ = 'infoTableItemValue__e188b0cb')[1].text
industry = parser.findAll('div', class_ = 'infoTableItemValue__e188b0cb')[1].text
sub_industry = parser.findAll('div', class_ = 'infoTableItemValue__e188b0cb')[2].text

The codes run fine only for a single stock extraction. But when I made it into a loop to extract a list of stocks, Bloomberg will block my IP and returned to blocked content.

Bloomberg - Are you a robot?

..............

document.getElementById("block_uuid").innerText = "Block reference ID: " + window._pxUuid;

Even if I used fake_useragent, Bloomberg blocked my IP either. Is there anything I can dodge to do the extraction of a list of stocks from Bloomberg?

You need to use proxies so that each request is made from a different IP. The best way I find is you use tor . There is a python library stem that can be used to control tor. What tor will do is create circuits so you can relay your request. The website won't see your IP but IP of tor relays. For every request, you can make new circuits. This, each request from a different IP. I have done something similar with selenium and Twitter . I destroy all circuits and then create new circuit for each request.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM