[英]Using python to extract name and phone number from a webpage
What I want to do is, on this site: 我想做的是在此站点上:
http://www.yellowpages.com/memphis-tn/gift-shops http://www.yellowpages.com/memphis-tn/gift-shops
I want to extract the name of the shop and its associated phone number into a CSV. 我想将商店名称及其关联的电话号码提取为CSV。 For example, the first entry should be: 例如,第一个条目应为:
Babcock Gifts, (901) 763-0700 Babcock礼品,(901)763-0700
etc.. 等等..
I am using Python. 我正在使用Python。 After performing a urllib2.urlopen( ), I have the entire blurb. 执行完urllib2.urlopen()之后,我便有了整个内容。 how do I process this text to achieve my goal? 如何处理此文字以实现我的目标?
I would suggest using regular expressions and hit on unique content in the lines. 我建议使用正则表达式,然后点击各行中的唯一内容。
IE: IE浏览器:
<a href="http://www.yellowpages.com/memphis-tn/mip/babcock-gifts-14131113?lid=187490699" class="url " data-analytics="{"click_id":1600,"rank":1,"act":1,"FL":"list","position":0}" title="Babcock Gifts">Babcock Gifts</a>
You would use something like: 您将使用类似:
re_name=re.compile('<a href=.*class=\"url\".*')
re_front=re.compile('^.*title="')
re_back=re.compile('".*')
for line in page:
if re_name.search(line):
out = re.front.sub('',line)
out = re.back.sub('',line)
print out
I tried BeautifulSoup 我尝试过BeautifulSoup
import urllib
import re
from BeautifulSoup import *
url = 'http://www.yellowpages.com/memphis-tn/gift-shops'
u = urllib.urlopen(url)
soup = BeautifulSoup(u)
test = soup.findAll('div', {'class':"info"})
for each in test:
aref = each.findAll('a',{'class':"url "})
phone = each.findAll('span',{'class':"business-phone phone"})
x = re.sub(r'[^0-9]',"",str(phone))
print aref[0]['title'] + " - " + x
I derived this script by looking at source code of html page. 我通过查看html页面的源代码来派生此脚本。 I found the 'div' section which contained listings. 我找到了包含清单的“ div”部分。 And then each company is listed in tags, which I got in 'aref'. 然后,每个公司都在标签中列出,我在“ aref”中找到了。
Strangely, I picked up 'phone', but the text contained the whole string include tag. 奇怪的是,我拿起“电话”,但文本包含整个字符串包括标签。 I am not sure why. 我不知道为什么。 So, I used a regex to substitute everything except numbers, which made up phone num. 因此,我使用正则表达式替换了组成电话号码的数字以外的所有内容。
here is the documentation for beautifulsoup. 这是beautifulsoup的文档。 http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.