Parsing: How do I strip out Unicode Characters?

Question

I wrote some code to grab the text in between the break elements on this webpage http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10478

I think i am on the right track but right now i am getting some bad values Below are my results [u'2133 Craigs Store Road', u'Afton,\\r\\n\\t\\tVA \\xa0\\r\\n\\t\\t22920', u'Contact Person:', u'Email Address:', u'Website:', u'Phone: 434-882-3150', u'']

I need to figure out how to strip out the unicode from my result values. Can anyone help?

r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10478')
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]

Contact=tbl.findAll('p')[0]

list=[]
for br in Contact.findAll('br'):
    next = br.nextSibling
    text=next.strip()
    list.append(text)
print list

Answer 1

from bs4 import BeautifulSoup, NavigableString, Tag
import requests
import re

r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10478')
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]

Contact=tbl.findAll('p')[0]

list=[]
for br in Contact.findAll('br'):
    next = br.nextSibling
    regex = re.compile(r'[\n\r\t\xa0]')
    text=next.strip()
    text=regex.sub(' ', next)
    list.append(text)
print list

I looked into it some more and figured out i could use regular expressions to take out those values.I still have an issue with spacing [u' 2133 Craigs Store Road', u'Afton, VA 22920', u'Contact Person: ', u'Email Address: ', u'Website: ', u'Phone: 434-882-3150', u' '] But at least the characters are gone

Answer 2

You can use the replace built-in function that str type has.

text = next.strip().replace("\n", "").replace("\t", "").replace("\r", "")

That way you can replace the \\n\\t\\r and replace them with nothing

Parsing: How do I strip out Unicode Characters?

Question

2 answers

solution1
0 2015-07-22 16:52:56

solution2
0 ACCPTED 2015-07-22 17:01:19

Parsing: How do I strip out Unicode Characters?

Question

2 answers

solution1 0 2015-07-22 16:52:56

solution2 0 ACCPTED 2015-07-22 17:01:19

solution1
0 2015-07-22 16:52:56

solution2
0 ACCPTED 2015-07-22 17:01:19