Python使用漂亮的汤从HTML标记中提取数字

Question

I am working on a web scraper using beautiful soup. 我正在使用美丽的汤做网络刮板。 Here is my function: 这是我的功能：

    journalist_result = soup.find_all("h4",class_="slab")
    if len(journalist_result)>0:
        journalist_share = int(re.match(r'\d+', journalist_result[0].get_text()).group())
    else:
        journalist_share=0

Basically, what I want to do is to extract the number of journalist that shared a link. 基本上，我要做的是提取共享链接的新闻记者人数。 In this case is 221 (see below for example): 在这种情况下为221（例如，请参见下文）：

CASE1: 情况1：

<h4 class="slab">221 journalists shared this link.
      <a href="/pros">Join</a> or <a href="/account/login?next=/whosharedmylink/?url=http://www.cnn.com/">sign in</a> to Muck Rack to view their names.</h3>

My code works fine for cases where there are journalist shares or if an URL is not found. 对于有记者共享或找不到URL的情况，我的代码可以正常工作。 However, my code breaks on the following case: 但是，我的代码在以下情况下中断：

CASE2: 案例2：

<h4 class="slab" style="margin-bottom:5px">

      This link hasn't yet been shared by any journalists.<br /><a href="/pros">Learn about using Muck Rack Pro</a> to connect with journalists.
</h4>

this is because in case 2, there are no journalist found. 这是因为在案例2中，没有找到记者。 And the error I am getting is: 我得到的错误是：

Traceback (most recent call last): File "muckrackscraper.py", line 65, in journalist_share = int(re.match(r'\\d+', journalist_result[0].get_text()).group()) AttributeError: 'NoneType' object has no attribute 'group' 追溯（最近一次通话最近）：文件“ muckrackscraper.py”，第65行，位于newsletter_share = int（re.match（r'\\ d +'，journal_result [0] .get_text（））。group（））AttributeError：' NoneType'对象没有属性'group'

THanks in advance for any help! 提前感谢您的帮助！

Answer 1

It seems like you've misunderstood why your code is failing. 似乎您已经误解了代码失败的原因。 It's not in case 2, but in case 1 where you don't check the return value from re.match and then attempt a function call on None . 不是在情况2中，而是在情况1中，您不检查re.match的返回值，然后尝试对None进行函数调用。

From the re.match documentation : 从re.match文档中：

Return None if the string does not match the pattern; 如果字符串与模式不匹配，则返回None；否则返回false。 note that this is different from a zero-length match. 请注意，这与零长度匹配不同。

So your pattern is not matching whatever is in journalist_result[0].get_text() ; 因此，您的模式与journalist_result[0].get_text()内容不匹配； try inspecting this value and also add a check for None . 尝试检查该值，并添加对None的检查。

Python使用漂亮的汤从HTML标记中提取数字

问题描述

1 个解决方案

解决方案1
1 2013-11-26 20:46:43

Python使用漂亮的汤从HTML标记中提取数字

问题描述

1 个解决方案

解决方案1 1 2013-11-26 20:46:43

解决方案1
1 2013-11-26 20:46:43