使用python从网页中提取姓名和电话号码

Question

What I want to do is, on this site: 我想做的是在此站点上：

http://www.yellowpages.com/memphis-tn/gift-shops http://www.yellowpages.com/memphis-tn/gift-shops

I want to extract the name of the shop and its associated phone number into a CSV. 我想将商店名称及其关联的电话号码提取为CSV。 For example, the first entry should be: 例如，第一个条目应为：

Babcock Gifts, (901) 763-0700 Babcock礼品，（901）763-0700

etc.. 等等..

I am using Python. 我正在使用Python。 After performing a urllib2.urlopen( ), I have the entire blurb. 执行完urllib2.urlopen（）之后，我便有了整个内容。 how do I process this text to achieve my goal? 如何处理此文字以实现我的目标？

Answer 1

I would suggest using regular expressions and hit on unique content in the lines. 我建议使用正则表达式，然后点击各行中的唯一内容。

IE: IE浏览器：

<a href="http://www.yellowpages.com/memphis-tn/mip/babcock-gifts-14131113?lid=187490699" class="url " data-analytics="{&quot;click_id&quot;:1600,&quot;rank&quot;:1,&quot;act&quot;:1,&quot;FL&quot;:&quot;list&quot;,&quot;position&quot;:0}" title="Babcock Gifts">Babcock Gifts</a>

You would use something like: 您将使用类似：

re_name=re.compile('<a href=.*class=\"url\".*')
re_front=re.compile('^.*title="')
re_back=re.compile('".*')
for line in page:
 if re_name.search(line):
  out = re.front.sub('',line)
  out = re.back.sub('',line)
print out

Answer 2

I tried BeautifulSoup 我尝试过BeautifulSoup

 import urllib
 import re
 from BeautifulSoup import *
 url = 'http://www.yellowpages.com/memphis-tn/gift-shops' 

 u = urllib.urlopen(url) 
 soup = BeautifulSoup(u)

test = soup.findAll('div', {'class':"info"})

for each in test:
    aref = each.findAll('a',{'class':"url "})
    phone = each.findAll('span',{'class':"business-phone phone"})
        x = re.sub(r'[^0-9]',"",str(phone))
    print aref[0]['title'] + " - " + x

I derived this script by looking at source code of html page. 我通过查看html页面的源代码来派生此脚本。 I found the 'div' section which contained listings. 我找到了包含清单的“ div”部分。 And then each company is listed in tags, which I got in 'aref'. 然后，每个公司都在标签中列出，我在“ aref”中找到了。

Strangely, I picked up 'phone', but the text contained the whole string include tag. 奇怪的是，我拿起“电话”，但文本包含整个字符串包括标签。 I am not sure why. 我不知道为什么。 So, I used a regex to substitute everything except numbers, which made up phone num. 因此，我使用正则表达式替换了组成电话号码的数字以外的所有内容。

here is the documentation for beautifulsoup. 这是beautifulsoup的文档。 http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

使用python从网页中提取姓名和电话号码

问题描述

2 个解决方案

解决方案1
0 2013-06-05 16:11:07

解决方案2
0 2013-06-07 23:09:13

使用python从网页中提取姓名和电话号码

问题描述

2 个解决方案

解决方案1 0 2013-06-05 16:11:07

解决方案2 0 2013-06-07 23:09:13

解决方案1
0 2013-06-05 16:11:07

解决方案2
0 2013-06-07 23:09:13