简体   繁体   English

使用python从网页中提取姓名和电话号码

[英]Using python to extract name and phone number from a webpage

What I want to do is, on this site: 我想做的是在此站点上:

http://www.yellowpages.com/memphis-tn/gift-shops http://www.yellowpages.com/memphis-tn/gift-shops

I want to extract the name of the shop and its associated phone number into a CSV. 我想将商店名称及其关联的电话号码提取为CSV。 For example, the first entry should be: 例如,第一个条目应为:

Babcock Gifts, (901) 763-0700 Babcock礼品,(901)763-0700

etc.. 等等..

I am using Python. 我正在使用Python。 After performing a urllib2.urlopen( ), I have the entire blurb. 执行完urllib2.urlopen()之后,我便有了整个内容。 how do I process this text to achieve my goal? 如何处理此文字以实现我的目标?

I would suggest using regular expressions and hit on unique content in the lines. 我建议使用正则表达式,然后点击各行中的唯一内容。

IE: IE浏览器:

<a href="http://www.yellowpages.com/memphis-tn/mip/babcock-gifts-14131113?lid=187490699" class="url " data-analytics="{&quot;click_id&quot;:1600,&quot;rank&quot;:1,&quot;act&quot;:1,&quot;FL&quot;:&quot;list&quot;,&quot;position&quot;:0}" title="Babcock Gifts">Babcock Gifts</a>

You would use something like: 您将使用类似:

re_name=re.compile('<a href=.*class=\"url\".*')
re_front=re.compile('^.*title="')
re_back=re.compile('".*')
for line in page:
 if re_name.search(line):
  out = re.front.sub('',line)
  out = re.back.sub('',line)
print out

I tried BeautifulSoup 我尝试过BeautifulSoup

 import urllib
 import re
 from BeautifulSoup import *
 url = 'http://www.yellowpages.com/memphis-tn/gift-shops' 

 u = urllib.urlopen(url) 
 soup = BeautifulSoup(u)

test = soup.findAll('div', {'class':"info"})

for each in test:
    aref = each.findAll('a',{'class':"url "})
    phone = each.findAll('span',{'class':"business-phone phone"})
        x = re.sub(r'[^0-9]',"",str(phone))
    print aref[0]['title'] + " - " + x

I derived this script by looking at source code of html page. 我通过查看html页面的源代码来派生此脚本。 I found the 'div' section which contained listings. 我找到了包含清单的“ div”部分。 And then each company is listed in tags, which I got in 'aref'. 然后,每个公司都在标签中列出,我在“ aref”中找到了。

Strangely, I picked up 'phone', but the text contained the whole string include tag. 奇怪的是,我拿起“电话”,但文本包含整个字符串包括标签。 I am not sure why. 我不知道为什么。 So, I used a regex to substitute everything except numbers, which made up phone num. 因此,我使用正则表达式替换了组成电话号码的数字以外的所有内容。

here is the documentation for beautifulsoup. 这是beautifulsoup的文档。 http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM