简体   繁体   中英

Problem extracting data from Bloomberg using bs4

I am using the below code to extract text from Bloomberg website

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://www.bloomberg.com/news/articles/2020-01-19/welcome-to-peak-decade-from-globalization-to-central-banks'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')

p_tags = soup.find_all('p')
sent_list = []
    for p in p_tags:
        if p.string:
            sent_list.append(p.string)

sent = ' '.join(word for word in slist)

print(sent)

the output I get is

To continue, please click the box below to let us know you're not a robot."

Is there any way I can get around this issue and extract the text from the website?

You got captcha. Bloomberg site is very strict against crawlers.

Second important notice. Site is under a paywall. So, you can see fulltext only of several pages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM