简体   繁体   中英

python xml parse cdata

im try to scrape news data from forex calendar, but i have small problem the xml file have

def get_news_calendar():
    r = requests.get('http://www.forexfactory.com/ffcal_week_this.xml')
    soup = BeautifulSoup(r.text , 'lxml')
    events = soup.find_all('event')
    for event in events:
        print event.find('title').text, event.find('country').text, event.find('date'), event.find('time').text, event.find('impact').text, event.find('forecast').text, event.find('previous').text

output :

Current Account EUR <date></date>    
Retail Sales m/m GBP <date></date>    
MPC Member Saunders Speaks GBP <date></date>    
Core CPI m/m CAD <date></date>    
CPI m/m CAD <date></date>    
Trimmed CPI y/y CAD <date></date>    
Median CPI y/y CAD <date></date>    
Common CPI y/y CAD <date></date>    
FOMC Member Kashkari Speaks USD <date></date>    
Flash Manufacturing PMI USD <date></date>    
Flash Services PMI USD <date></date>    
Existing Home Sales USD <date></date>    
IMF Meetings ALL <date></date>    
IMF Meetings ALL <date></date>    
Treasury Sec Mnuchin Speaks USD <date></date>    
French Presidential Election EUR <date></date>

example xml file :

<event>
    <title>German Flash Manufacturing PMI</title>
    <country>EUR</country>
    <date><![CDATA[04-21-2017]]></date>
    <time><![CDATA[7:30am]]></time>
    <impact><![CDATA[Medium]]></impact>
    <forecast><![CDATA[58.1]]></forecast>
    <previous><![CDATA[58.3]]></previous>
</event> 

how i can print the value of cdata ?

You appear to have got the name of the parser wrong. You are parsing an XML document, so you need to use lxml-xml instead of lxml .

Try replacing

soup = BeautifulSoup(r.text , 'lxml')

with

soup = BeautifulSoup(r.text , 'lxml-xml')

After making this change to your get_news_calendar function I get the following output running it on your example XML file:

German Flash Manufacturing PMI EUR <date>04-21-2017</date> 7:30am Medium 58.1 58.3

Consider directly using lxml and run xpath on all <event> nodes as .text() can retrieve CData content.

import requests
import lxml.etree as et

def get_news_calendar():        
    r = requests.get('http://www.forexfactory.com/ffcal_week_this.xml')
    data = et.fromstring(r.text.encode("utf-8"))

    events = data.xpath('//event')
    for event in events:
        print(event.find('title').text, event.find('country').text,
              event.find('date').text, event.find('time').text, 
              event.find('impact').text, event.find('forecast').text, 
              event.find('previous').text)

get_news_calendar()

# Bank Holiday NZD 04-16-2017 9:00pm Holiday None None
# Bank Holiday AUD 04-16-2017 10:00pm Holiday None None
# GDP q/y CNY 04-17-2017 2:00am High 6.8% 6.8%
# Industrial Production y/y CNY 04-17-2017 2:00am High 6.2% 6.3%
# Fixed Asset Investment ytd/y CNY 04-17-2017 2:00am Medium 8.8% 8.9%
# NBS Press Conference CNY 04-17-2017 2:00am Medium None None
# Retail Sales y/y CNY 04-17-2017 2:00am Low 9.7% 9.5%
# Bank Holiday CHF 04-17-2017 6:00am Holiday None None
# BOJ Gov Kuroda Speaks JPY 04-17-2017 6:15am High None None
# Bank Holiday GBP 04-17-2017 7:00am Holiday None None
# French Bank Holiday EUR 04-17-2017 7:00am Holiday None None
# ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM