Beautifulsoup, maximum recursion depth reached

Question

This is a beautifulsoup procedure that grabs content within all <p> html tags. After grabbing content from some web pages, I get an error that says maximum recursion depth exceeded.

def printText(tags):
    for tag in tags:
        if tag.__class__ == NavigableString:
            print tag,
        else:
            printText(tag)
    print ""
#loop over urls, send soup to printText procedure

The bottom of trace:

 File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 13, in printText
    if tag.__class__ == NavigableString:
RuntimeError: maximum recursion depth exceeded in cmp

Answer 1

Your printText() calls itself recursively if it encounters anything other than a NavigableString. This includes subclasses of NavigableString, such as Comment. Calling printText() on a Comment iterates over the text of the comment, and causes the infinite recursion you see.

I recommend using isinstance() in your if statement instead of comparing class objects:

if isinstance(tag, basestring):

I diagnosed this problem by inserting a print statement before the recursion:

print "recursing on", tag, type(tag)
printText(tag)

Answer 2

You probably hit a string. Iterating over a string yields 1-length strings. Iterating over that 1-length string yields a 1-length string. Iterating over THAT 1-length string...

Answer 3

I had the same problem. If you have nested tags with a depth of about 480 levels, and you want to convert this tag to string/unicode, you get the RuntimeError maximum recursion depth reached . Every level needs two nested method calls and soon you hit the default of 1000 nested python calls. You can raise this level, or you can use this helper. It extracts all text from the html and displays it in a pre-environment:

def beautiful_soup_tag_to_unicode(tag):
    try:
        return unicode(tag)
    except RuntimeError as e:
        if not str(e).startswith('maximum recursion'):
            raise
        # If you have more than 480 level of nested tags you can hit the maximum recursion level
        out=[]
        for mystring in tag.findAll(text=True):
            mystring=mystring.strip()
            if not mystring:
                continue
            out.append(mystring)
        return u'<pre>%s</pre>' % '\n'.join(out)

Beautifulsoup, maximum recursion depth reached

Question

3 answers

solution1
5 2012-04-12 13:58:21

solution2
1 ACCPTED 2012-04-12 06:06:00

solution3
0 2012-08-28 09:31:26

Beautifulsoup, maximum recursion depth reached

Question

3 answers

solution1 5 2012-04-12 13:58:21

solution2 1 ACCPTED 2012-04-12 06:06:00

solution3 0 2012-08-28 09:31:26

solution1
5 2012-04-12 13:58:21

solution2
1 ACCPTED 2012-04-12 06:06:00

solution3
0 2012-08-28 09:31:26