简体   繁体   中英

Extracting a link from html using python and BeautifulSoup: 'NoneType' object has no attribute 'attrs'

Hi there I am using python 3 beautifulsoup to try and extract the link. It works most of the time but every now and then it cant find the schema.

Code I have looks like this(part of a larger body):

self.schema = self.soup.find(['link:schemaRef', 'schemaRef']).get('xlink:href')

self.namespaces = {}

for k in self.soup.find('html').attrs:
    if k.startswith("xmlns") or ":" in k:
        self.namespaces[k] = self.soup.find('html')[k].split(" ")

has no issue finding the schema in this kind of stuff:

<ix:references>
    <link:schemaRef xlink:type="simple" xlink:href="https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd" />
</ix:references>

but it cant find xlink:href in these ones:

<references>
    <schemaRef xlink:href="https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd" xlink:type="simple" xmlns="http://www.xbrl.org/2003/linkbase"/>
</references>

The error I get is:

AttributeError                            Traceback (most recent call last)
<ipython-input-8-da0992ab9ae8> in <module>
     96 
     97         with open(filename,encoding="utf8") as a:
---> 98             x = Parser(a)
     99             r = json.dumps(x.to_table(), indent=4)
    100             jsondata = json.loads(r)

~\OneDrive\Desktop\parser\core.py in __init__(self, f, raise_on_error)
     21         self.errors = []
     22 
---> 23         self._get_schema()
     24 
     25         self._get_contexts()

~\OneDrive\Desktop\parser\core.py in _get_schema(self)
     47         self.schema = self.soup.find(
     48 
---> 49             ['link:schemaRef', 'schemaRef']).get('xlink:href')
     50 
     51         self.namespaces = {}

AttributeError: 'NoneType' object has no attribute 'get'

Any help would be much appreciated

Thank you.

From your error trace back, the line call

self.soup.find(['link:schemaRef', 'schemaRef'])

is returning None. To protect against this, you should test the result before executing get , ie:

data = self.soup.find(['link:schemaRef', 'schemaRef'])
if data is not None:
    self.schema = data.get('xlink:href')

@dspencer So this returns the correct schema.

from bs4 import BeautifulSoup

with open("F:\ErrorFolder\06647909.html", "r") as f:
    soup = BeautifulSoup(f, 'html.parser')
    resources = soup.find(['ix:references', 'references'])
    #print(resources)
    for s in resources.find_all(['link:schemaRef', 'schemaRef', 'schemaref']):
        x = s.get('xlink:href')
        print(x)

So I just need to change stuff around it seems the real issue might be the schemaref vs schemaRef

Thank you so much you've been really helpful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM