I wrote some code to grab the text in between the break elements on this webpage http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10478
I think i am on the right track but right now i am getting some bad values Below are my results [u'2133 Craigs Store Road', u'Afton,\\r\\n\\t\\tVA \\xa0\\r\\n\\t\\t22920', u'Contact Person:', u'Email Address:', u'Website:', u'Phone: 434-882-3150', u'']
I need to figure out how to strip out the unicode from my result values. Can anyone help?
r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10478')
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
Contact=tbl.findAll('p')[0]
list=[]
for br in Contact.findAll('br'):
next = br.nextSibling
text=next.strip()
list.append(text)
print list
from bs4 import BeautifulSoup, NavigableString, Tag
import requests
import re
r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10478')
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
Contact=tbl.findAll('p')[0]
list=[]
for br in Contact.findAll('br'):
next = br.nextSibling
regex = re.compile(r'[\n\r\t\xa0]')
text=next.strip()
text=regex.sub(' ', next)
list.append(text)
print list
I looked into it some more and figured out i could use regular expressions to take out those values.I still have an issue with spacing [u' 2133 Craigs Store Road', u'Afton, VA 22920', u'Contact Person: ', u'Email Address: ', u'Website: ', u'Phone: 434-882-3150', u' '] But at least the characters are gone
You can use the replace built-in function that str
type has.
text = next.strip().replace("\n", "").replace("\t", "").replace("\r", "")
That way you can replace the \\n\\t\\r
and replace them with nothing
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.