简体   繁体   中英

Decoding nonbreaking space in pandas read_html

The default behavior of pandas.read_html appears to be to convert   characters to unicode \\xa0 codes:

url = 'http://www.reuters.com/finance/stocks/company-officers/IBM'
ibm = pd.read_html(url, header=0)[0]
ibm.iloc[0,0]

'Virginia\\xa0Rometty'

I know I can use a converter to convert these to spaces as follows:

spacer = lambda s: s.replace(u'\xa0', ' ')
ibm = pd.read_html(url, header=0, converters={'Name':spacer})[0]
ibm.iloc[0,0]

'Virginia Rometty'

This seems unnecessarily complicated for something that must be a pretty common. Is there another way? Perhaps an encoding option?

I don't think an encoding option will fix this, but you can just get rid of them. Using str.replace , you can get rid of any non-ASCII and replace it with a space.

ibm['Name'] = ibm['Name'].str.replace('[^\x00-\x8F]', ' ') 

Or, just the non-breaking space -

ibm['Name'] = ibm['Name'].str.replace('\xa0', ' ')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM