无法弄清楚如何使用beautifulsoup进行网络抓取

Question

I am trying to scrape the info below from some web page. 我正在尝试从某些网页上抓取以下信息。 This is the full code: 这是完整的代码：

<tr class="owner">
   <td id="P184" class="ownerP" colspan="4">
      <ul>
         <li><span class="detailType">name:</span><span class="detail">merry/span></li>
         <li><a title="sendmessage" class="sendMessageLink" onclick="return openSendMessage('/sendMessage.php',20205" href="" tabindex="0"><span></span>sendmessage</a>&nbsp;<span class="remark_soft">(by pm system)</span></li>
         <li><span class="detailType">phone 1</span><a class="detail" href="tel:0387362531">0387362531</a></li>
         <li><span class="detailType"></span></li>
      </ul>
   </td>
</tr>

I want to only get this info (the phone number): 我只想获取此信息（电话号码）：

 <a class="detail" href="tel:0387362531">0387362531</a>

Here is my code, but it doesn't work: 这是我的代码，但是不起作用：

 for details in soup.find_all(attrs= {"class": "detail"}):
    re_res = re.search(r"tel:\('.*?',(\d+)\)", details['href'])
    print(re_res)

Answer 1

You are pretty close, here you go: 您非常接近，在这里您可以：

import re
from bs4 import BeautifulSoup

html_doc = """
<tr class="owner"><td id="P184" class="ownerP" colspan="4"><ul>
            <li><span class="detailType">name:</span><span class="detail">merry/span></li>
            <li><a title="sendmessage" class="sendMessageLink" onclick="return openSendMessage('/sendMessage.php',20205" href="" tabindex="0"><span></span>sendmessage</a>&nbsp;<span class="remark_soft">(by pm system)</span></li><li><span class="detailType">phone 1</span><a class="detail" href="tel:0387362531">0387362531</a></li><li><span class="detailType"></span></li>
</ul></td></tr>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

for details in soup.find_all(attrs= {"class": "detail"}):
    if "href" in details.attrs and re.search("^tel:", details.attrs["href"]):
        print(details.text)

Output: 输出：

0387362531

I'm simply looking through the details list you've made and if I find one that has and href and that href starts with tel: then print that value out. 我只是在浏览您创建的详细信息列表，如果我发现有一个具有href且href以tel:开头的列表，然后打印出该值。

Answer 2

You should replace soup.find_all(attrs= {"class": "detail"}) by soup.find_all('a', attrs= {"class": "detail"})[0] in order to avoid having the span too in details . 您应该将soup.find_all(attrs= {"class": "detail"}) soup.find_all('a', attrs= {"class": "detail"})[0] ，以避免产生span太details 。

Moreover your regex does not work, this one should work tel:(\\d+) . 而且您的正则表达式不起作用，这一行应该起作用tel:(\\d+) 。 But rather than using a regex why not just getting a tag text by doing details.text ? 但是，而不是使用正则表达式为什么不干脆让a做标记文字details.text ？

Answer 3

You have to add the element type a to find_all and your regex tel:\\('.*?',(\\d+)\\) tries to match opening and closing parenthesis \\( and \\) which are not in the href . 您必须将元素类型a添加到find_all，并且您的正则表达式tel:\\('.*?',(\\d+)\\)尝试匹配href 左括号和右括号\\(和\\) 。

You could update your regex to tel:(\\d+) to match tel: followed by one or more digits in a capturing group (group 1) which you can retrieve with re_res.group(1) 您可以将正则表达式更新为tel:(\\d+)以匹配tel:后跟捕获组（组1）中的一个或多个数字，可以使用re_res.group(1)检索。

For example: 例如：

for details in soup.find_all('a', attrs= {"class": "detail"}):
    re_res = re.search(r"tel:(\d+)", details['href'])
    print(re_res.group(1))  # 0387362531

Answer 4

You can get the same result without using regex. 您无需使用正则表达式即可获得相同的结果。 In that case, try the below approach: 在这种情况下，请尝试以下方法：

from bs4 import BeautifulSoup

html_doc = """
<tr class="owner"><td id="P184" class="ownerP" colspan="4"><ul>
            <li><span class="detailType">name:</span><span class="detail">merry/span></li>
            <li><a title="sendmessage" class="sendMessageLink" onclick="return openSendMessage('/sendMessage.php',20205" href="" tabindex="0"><span></span>sendmessage</a>&nbsp;<span class="remark_soft">(by pm system)</span></li><li><span class="detailType">phone 1</span><a class="detail" href="tel:0387362531">0387362531</a></li><li><span class="detailType"></span></li>
</ul></td></tr>
"""

Using .select() : 使用.select() ：

soup = BeautifulSoup(html_doc, 'html.parser')
for telephone in soup.select("a[href^='tel:']"):
    if "detail" in telephone['class']:
        print(telephone.text)

Or with .find_all() : 或使用.find_all() ：

soup = BeautifulSoup(html_doc, 'html.parser')
for telephone in soup.find_all("a",class_="detail"):
    if telephone['href'].startswith('tel:'):
        print(telephone.text)

They both produce the same output: 它们都产生相同的输出：

0387362531

无法弄清楚如何使用beautifulsoup进行网络抓取

问题描述

4 个解决方案

解决方案1
1 已采纳 2018-06-20 14:39:51

解决方案2
0 2018-06-20 14:41:29

解决方案3
0 2018-06-20 14:48:42

解决方案4
0 2018-06-20 18:31:19

无法弄清楚如何使用beautifulsoup进行网络抓取

问题描述

4 个解决方案

解决方案1 1 已采纳 2018-06-20 14:39:51

解决方案2 0 2018-06-20 14:41:29

解决方案3 0 2018-06-20 14:48:42

解决方案4 0 2018-06-20 18:31:19

解决方案1
1 已采纳 2018-06-20 14:39:51

解决方案2
0 2018-06-20 14:41:29

解决方案3
0 2018-06-20 14:48:42

解决方案4
0 2018-06-20 18:31:19