Hi there I am using python 3 beautifulsoup to try and extract the link. It works most of the time but every now and then it cant find the schema.
Code I have looks like this(part of a larger body):
self.schema = self.soup.find(['link:schemaRef', 'schemaRef']).get('xlink:href')
self.namespaces = {}
for k in self.soup.find('html').attrs:
if k.startswith("xmlns") or ":" in k:
self.namespaces[k] = self.soup.find('html')[k].split(" ")
has no issue finding the schema in this kind of stuff:
<ix:references>
<link:schemaRef xlink:type="simple" xlink:href="https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd" />
</ix:references>
but it cant find xlink:href in these ones:
<references>
<schemaRef xlink:href="https://xbrl.frc.org.uk/FRS-102/2014-09-01/FRS-102-2014-09-01.xsd" xlink:type="simple" xmlns="http://www.xbrl.org/2003/linkbase"/>
</references>
The error I get is:
AttributeError Traceback (most recent call last)
<ipython-input-8-da0992ab9ae8> in <module>
96
97 with open(filename,encoding="utf8") as a:
---> 98 x = Parser(a)
99 r = json.dumps(x.to_table(), indent=4)
100 jsondata = json.loads(r)
~\OneDrive\Desktop\parser\core.py in __init__(self, f, raise_on_error)
21 self.errors = []
22
---> 23 self._get_schema()
24
25 self._get_contexts()
~\OneDrive\Desktop\parser\core.py in _get_schema(self)
47 self.schema = self.soup.find(
48
---> 49 ['link:schemaRef', 'schemaRef']).get('xlink:href')
50
51 self.namespaces = {}
AttributeError: 'NoneType' object has no attribute 'get'
Any help would be much appreciated
Thank you.
From your error trace back, the line call
self.soup.find(['link:schemaRef', 'schemaRef'])
is returning None. To protect against this, you should test the result before executing get
, ie:
data = self.soup.find(['link:schemaRef', 'schemaRef'])
if data is not None:
self.schema = data.get('xlink:href')
@dspencer So this returns the correct schema.
from bs4 import BeautifulSoup
with open("F:\ErrorFolder\06647909.html", "r") as f:
soup = BeautifulSoup(f, 'html.parser')
resources = soup.find(['ix:references', 'references'])
#print(resources)
for s in resources.find_all(['link:schemaRef', 'schemaRef', 'schemaref']):
x = s.get('xlink:href')
print(x)
So I just need to change stuff around it seems the real issue might be the schemaref vs schemaRef
Thank you so much you've been really helpful
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.