简体   繁体   中英

python beautifulsoup can't prettify

I seem to be doing something wrong. I have an HTML source that I pull using urllib. Based on this HTML file I use beautifulsoup to findAll elements with an ID based on a specified array. This works for me, however the output is messy and includes linebreaks "\\n".

  • Python: 2.7.12
  • BeautifulSoup: bs4

I have tried to use prettify() to correct the output but always get an error:

AttributeError: 'ResultSet' object has no attribute 'prettify'

import urllib
import re
from bs4 import BeautifulSoup

cfile = open("test.txt")
clist = cfile.read()
clist = clist.split('\n')

i=0

while i<len (clist):
    url = "https://example.com/"+clist[i]
    htmlfile = urllib.urlopen (url)
    htmltext = htmlfile.read()

    soup = BeautifulSoup (htmltext, "html.parser")
    soup = soup.findAll (id=["id1", "id2", "id3"])

print soup.prettify()
i+=1

I'm sure there is something simple I am overlooking with this line:

soup = soup.findAll (id=["id1", "id2", "id3"])

I'm just not sure what. Sorry if this is a stupid question. I've only been using Python and Beautiful Soup for a few days.

You are reassigning the soup variable to the result of .findAll() , which is a ResultSet object (basically, a list of tags) which does not have the prettify() method.

The solution is to keep the soup variable pointing to the BeautifulSoup instance.

You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects:

findAll return a list of match tags, so your code equal to [tag1,tag2..].prettify() and it will not work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM