简体   繁体   中英

How to Extract Data from Tags Using Beautiful Soup

I'm trying to retrieve data from a website. My code is as follows:

import re
from urllib2 import urlopen
from bs4 import BeautifulSoup

# gets a file-like object using urllib2.urlopen
url = 'http://ecal.forexpros.com/e_cal.php?duration=weekly'
html = urlopen(url)

soup = BeautifulSoup(html)

# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr'
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}):
    # finds desired data by looking up <td> elements with class names

    event = tr.find('td', {'class': 'ec_td_event'}).text
    currency = tr.find('td', {'class': 'ec_td_currency'}).text
    actual = tr.find('td', {'class': 'ec_td_actual'}).text
    forecast = tr.find('td', {'class': 'ec_td_forecast'}).text
    previous = tr.find('td', {'class': 'ec_td_previous'}).text
    time = tr.find('td', {'class': 'ec_td_time'}).text
    importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')

    # the returned strings are unicode, so to print them we need to use a unicode string
    if importance == 'High':
        print(u'\t{:5}\t{}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, importance, currency, event, actual, forecast, previous))

The first few records in the result set are as follows:

05:00   High    EUR CPI (YoY)                                   1.3%        1.3%        1.3%    
10:00   High    USD Pending Home Sales (MoM)                    1.5%        0.7%        -0.7%   
21:45   High    CNY Caixin Manufacturing PMI                    51.1        50.4        50.4    
00:30   High    AUD RBA Interest Rate Decision                  1.50%       1.50%       1.50%   
00:30   High    AUD RBA Rate Statement                                                          
03:55   High    EUR German Manufacturing PMI                    58.1        58.3        58.3    
03:55   High    EUR German Unemployment Change                  -9K         -5K         6K      

I'm trying to now retrieve similar data from the following website:

https://www.fxstreet.com/economic-calendar

To do so, I revised the above-mentioned code as follows:

import re
from urllib2 import urlopen
from bs4 import BeautifulSoup

# gets a file-like object using urllib2.urlopen
url = 'https://www.fxstreet.com/economic-calendar'
html = urlopen(url)

soup = BeautifulSoup(html)


for tr in soup.find_all('tr', {'class': re.compile('fxst-tr-event fxst-oddRow  fxit-eventrow fxst-evenRow ')}):
    # finds desired data by looking up <div> elements with class names

    event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
    currency = tr.find('div', {'class': 'fxit-event-name'}).text
    actual = tr.find('div', {'class': ' fxit-actual'}).text
    forecast = tr.find('div', {'class': 'fxit-consensus'}).text
    previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text
    time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
#    importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')

    # the returned strings are unicode, so to print them we need to use a unicode string
    if importance == 'High':
        print(u'\t{:5}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, currency, event, actual, forecast, previous))

This code does not return any results (presumably because I'm referencing incorrect tags and/or classes). Does anyone see where my error is?

Thanks!

You should use selenium + Chromedriver / PhantomJS to parse through dynamically created JavaScript content, urllib2 doesn't handle that. I don't think it does much sense to use regex here, you can use the lxml parser to allow multiple classes and use them in a list. Below is an example using the already mentioned tools:

from bs4 import BeautifulSoup
from selenium import webdriver

url = 'https://www.fxstreet.com/economic-calendar'

driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

for tr in soup.findAll('tr',{'class':['fxst-tr-event', 'fxst-oddRow', 'fxit-eventrow', 'fxst-evenRow', 'fxs_cal_nextEvent']}):
    event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
    currency = tr.find('div', {'class': 'fxit-event-name'}).text
    actual = tr.find('div', {'class': 'fxit-actual'}).text
    forecast = tr.find('div', {'class': 'fxit-consensus'}).text
    previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text
    time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text

    print(time, currency, event, actual, forecast, previous)

Note lxml is a library itself, you can handle multiple classes using the standard html.parser but it's not as intuitive in my opinion. This code prints:

14:00 
CAD                                     14:00 None 59.2 
61.6                                    
14:00 
CAD                                     14:00 52.9  
63.9                                    
17:00 
USD                                     17:00 765 
...
...

I haven't altered any of the variables because I'm not really sure what you want them to be, so further adjusting of that and formatting the output should be ideal.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM