简体   繁体   English

Python使用漂亮的汤从HTML标记中提取数字

[英]Python extracting number from HTML tag using beautiful soup

I am working on a web scraper using beautiful soup. 我正在使用美丽的汤做网络刮板。 Here is my function: 这是我的功能:

    journalist_result = soup.find_all("h4",class_="slab")
    if len(journalist_result)>0:
        journalist_share = int(re.match(r'\d+', journalist_result[0].get_text()).group())
    else:
        journalist_share=0

Basically, what I want to do is to extract the number of journalist that shared a link. 基本上,我要做的是提取共享链接的新闻记者人数。 In this case is 221 (see below for example): 在这种情况下为221(例如,请参见下文):

CASE1: 情况1:

<h4 class="slab">221 journalists shared this link.
      <a href="/pros">Join</a> or <a href="/account/login?next=/whosharedmylink/?url=http://www.cnn.com/">sign in</a> to Muck Rack to view their names.</h3>

My code works fine for cases where there are journalist shares or if an URL is not found. 对于有记者共享或找不到URL的情况,我的代码可以正常工作。 However, my code breaks on the following case: 但是,我的代码在以下情况下中断:

CASE2: 案例2:

<h4 class="slab" style="margin-bottom:5px">

      This link hasn't yet been shared by any journalists.<br /><a href="/pros">Learn about using Muck Rack Pro</a> to connect with journalists.
</h4>

this is because in case 2, there are no journalist found. 这是因为在案例2中,没有找到记者。 And the error I am getting is: 我得到的错误是:

Traceback (most recent call last): File "muckrackscraper.py", line 65, in journalist_share = int(re.match(r'\\d+', journalist_result[0].get_text()).group()) AttributeError: 'NoneType' object has no attribute 'group' 追溯(最近一次通话最近):文件“ muckrackscraper.py”,第65行,位于newsletter_share = int(re.match(r'\\ d +',journal_result [0] .get_text())。group())AttributeError:' NoneType'对象没有属性'group'

THanks in advance for any help! 提前感谢您的帮助!

It seems like you've misunderstood why your code is failing. 似乎您已经误解了代码失败的原因。 It's not in case 2, but in case 1 where you don't check the return value from re.match and then attempt a function call on None . 不是在情况2中,而是在情况1中,您不检查re.match的返回值,然后尝试对None进行函数调用。

From the re.match documentation : re.match文档中

Return None if the string does not match the pattern; 如果字符串与模式不匹配,则返回None;否则返回false。 note that this is different from a zero-length match. 请注意,这与零长度匹配不同。

So your pattern is not matching whatever is in journalist_result[0].get_text() ; 因此,您的模式与journalist_result[0].get_text()内容不匹配; try inspecting this value and also add a check for None . 尝试检查该值,并添加对None的检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM