简体   繁体   中英

Parsing: How do I strip out Unicode Characters?

I wrote some code to grab the text in between the break elements on this webpage http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10478

I think i am on the right track but right now i am getting some bad values Below are my results [u'2133 Craigs Store Road', u'Afton,\\r\\n\\t\\tVA \\xa0\\r\\n\\t\\t22920', u'Contact Person:', u'Email Address:', u'Website:', u'Phone: 434-882-3150', u'']

I need to figure out how to strip out the unicode from my result values. Can anyone help?

r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10478')
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]

Contact=tbl.findAll('p')[0]

list=[]
for br in Contact.findAll('br'):
    next = br.nextSibling
    text=next.strip()
    list.append(text)
print list
from bs4 import BeautifulSoup, NavigableString, Tag
import requests
import re

r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10478')
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]

Contact=tbl.findAll('p')[0]

list=[]
for br in Contact.findAll('br'):
    next = br.nextSibling
    regex = re.compile(r'[\n\r\t\xa0]')
    text=next.strip()
    text=regex.sub(' ', next)
    list.append(text)
print list          

I looked into it some more and figured out i could use regular expressions to take out those values.I still have an issue with spacing [u' 2133 Craigs Store Road', u'Afton, VA 22920', u'Contact Person: ', u'Email Address: ', u'Website: ', u'Phone: 434-882-3150', u' '] But at least the characters are gone

You can use the replace built-in function that str type has.

text = next.strip().replace("\n", "").replace("\t", "").replace("\r", "")

That way you can replace the \\n\\t\\r and replace them with nothing

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM