简体   繁体   English

如何使用python,BeautifulSoup获取跨度值

[英]How to get span value using python,BeautifulSoup

I am using BeautifulSoup for the first time and trying to collect several data such as email,phone number, and mailing address from a soup object. 我是第一次使用BeautifulSoup,并尝试从汤对象中收集一些数据,例如电子邮件,电话号码和邮寄地址。

Using regular expressions, I can identify the email address. 使用正则表达式,我可以识别电子邮件地址。 My code to find the email is: 我找到电子邮件的代码是:

def get_email(link):
mail_list = []
for i in link:
        a = str(i)
        email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\">", re.IGNORECASE)
        ik = re.findall(email_pattern, a)
        if (len(ik) == 1):
                mail_list.append(i)
        else:
                pass
s_email = str(mail_list[0]).split('<a href="')
t_email = str(s_email[1]).split('">')
print t_email[0]

Now, I also need to collect the phone number, mailing address and web url. 现在,我还需要收集电话号码,邮寄地址和网址。 I think in BeautifulSoup there must be an easy way to find those specific data. 我认为在BeautifulSoup中,必须有一种简单的方法来查找这些特定数据。

A sample html page is as below: 一个示例html页面如下:

<ul>
    <li>
    <span>Email:</span>
    <a href="mailto:abc@gmail.com">Message Us</a>
    </li>
    <li>
    <span>Website:</span>
    <a target="_blank" href="http://www.abcl.com">Visit Our Website</a>
    </li>
    <li>
    <span>Phone:</span>
    (123)456-789
    </li>
    </ul>

And using BeatifulSoup, I am trying to collect the span values of Email, website and Phone. 并使用BeatifulSoup,我试图收集电子邮件,网站和电话的跨度值。

Thanks in advance. 提前致谢。

The most obvious problem with your code is that you're turning the object representing the link back into HTML and then parsing it with a regular expression again - that ignores much of the point of using BeautifulSoup in the first place. 代码最明显的问题是,您将表示链接的对象重新转换为HTML,然后再次使用正则表达式对其进行解析-首先,它忽略了使用BeautifulSoup的许多要点。 You might need to use a regular expression to deal with the contents of the href attribute, but that's it. 您可能需要使用正则表达式来处理href属性的内容,仅此而已。 Also, the else: pass is unnecessary - you can just leave it out entirely. 另外, else: pass是不必要的-您可以完全省略else: pass

Here's some code that does something like what you want, and might be a useful starting point: 这是一些执行所需操作的代码,可能是一个有用的起点:

from BeautifulSoup import BeautifulSoup
import re

# Assuming that html is your input as a string:
soup = BeautifulSoup(html)

all_contacts = []

def mailto_link(e):
    '''Return the email address if the element is is a mailto link,
    otherwise return None'''
    if e.name != 'a':
        return None
    for key, value in e.attrs:
        if key == 'href':
            m = re.search('mailto:(.*)',value)
            if m:
                return m.group(1)
    return None

for ul in soup.findAll('ul'):
    contact = {}
    for li in soup.findAll('li'):
        s = li.find('span')
        if not (s and s.string):
            continue
        if s.string == 'Email:':
            a = li.find(mailto_link)
            if a:
                contact['email'] = mailto_link(a)
        elif s.string == 'Website:':
            a = li.find('a')
            if a:
                contact['website'] = a['href']
        elif s.string == 'Phone:':
            contact['phone'] = unicode(s.nextSibling).strip()
    all_contacts.append(contact)

print all_contacts

That will produce a list of one dictionary per contact found, in this case that will just be: 这将为找到的每个联系人生成一个词典列表,在这种情况下,将为:

[{'website': u'http://www.abcl.com', 'phone': u'(123)456-789', 'email': u'abc@gmail.com'}]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM