I'm trying to retrieve data from a website. My code is as follows:
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
# gets a file-like object using urllib2.urlopen
url = 'http://ecal.forexpros.com/e_cal.php?duration=weekly'
html = urlopen(url)
soup = BeautifulSoup(html)
# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr'
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}):
# finds desired data by looking up <td> elements with class names
event = tr.find('td', {'class': 'ec_td_event'}).text
currency = tr.find('td', {'class': 'ec_td_currency'}).text
actual = tr.find('td', {'class': 'ec_td_actual'}).text
forecast = tr.find('td', {'class': 'ec_td_forecast'}).text
previous = tr.find('td', {'class': 'ec_td_previous'}).text
time = tr.find('td', {'class': 'ec_td_time'}).text
importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')
# the returned strings are unicode, so to print them we need to use a unicode string
if importance == 'High':
print(u'\t{:5}\t{}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, importance, currency, event, actual, forecast, previous))
The first few records in the result set are as follows:
05:00 High EUR CPI (YoY) 1.3% 1.3% 1.3%
10:00 High USD Pending Home Sales (MoM) 1.5% 0.7% -0.7%
21:45 High CNY Caixin Manufacturing PMI 51.1 50.4 50.4
00:30 High AUD RBA Interest Rate Decision 1.50% 1.50% 1.50%
00:30 High AUD RBA Rate Statement
03:55 High EUR German Manufacturing PMI 58.1 58.3 58.3
03:55 High EUR German Unemployment Change -9K -5K 6K
I'm trying to now retrieve similar data from the following website:
https://www.fxstreet.com/economic-calendar
To do so, I revised the above-mentioned code as follows:
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
# gets a file-like object using urllib2.urlopen
url = 'https://www.fxstreet.com/economic-calendar'
html = urlopen(url)
soup = BeautifulSoup(html)
for tr in soup.find_all('tr', {'class': re.compile('fxst-tr-event fxst-oddRow fxit-eventrow fxst-evenRow ')}):
# finds desired data by looking up <div> elements with class names
event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
currency = tr.find('div', {'class': 'fxit-event-name'}).text
actual = tr.find('div', {'class': ' fxit-actual'}).text
forecast = tr.find('div', {'class': 'fxit-consensus'}).text
previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text
time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
# importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')
# the returned strings are unicode, so to print them we need to use a unicode string
if importance == 'High':
print(u'\t{:5}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, currency, event, actual, forecast, previous))
This code does not return any results (presumably because I'm referencing incorrect tags and/or classes). Does anyone see where my error is?
Thanks!
You should use selenium
+ Chromedriver
/ PhantomJS
to parse through dynamically created JavaScript content, urllib2
doesn't handle that. I don't think it does much sense to use regex
here, you can use the lxml
parser to allow multiple classes and use them in a list. Below is an example using the already mentioned tools:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.fxstreet.com/economic-calendar'
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for tr in soup.findAll('tr',{'class':['fxst-tr-event', 'fxst-oddRow', 'fxit-eventrow', 'fxst-evenRow', 'fxs_cal_nextEvent']}):
event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
currency = tr.find('div', {'class': 'fxit-event-name'}).text
actual = tr.find('div', {'class': 'fxit-actual'}).text
forecast = tr.find('div', {'class': 'fxit-consensus'}).text
previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text
time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
print(time, currency, event, actual, forecast, previous)
Note lxml
is a library itself, you can handle multiple classes using the standard html.parser
but it's not as intuitive in my opinion. This code prints:
14:00
CAD 14:00 None 59.2
61.6
14:00
CAD 14:00 52.9
63.9
17:00
USD 17:00 765
...
...
I haven't altered any of the variables because I'm not really sure what you want them to be, so further adjusting of that and formatting the output should be ideal.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.