简体   繁体   中英

Unicode object error in parsing XML using BeautifulSoup

Parsing the contents of 'name' tag in the XML output using BeautifulSoup gives me the following error:

AttributeError: 'unicode' object has no attribute 'get_text'

XML Output:

<show>
  <stud>
    <__readonly__>
      <TABLE_stud>
        <ROW_stud>
          <name>rice</name>
          <dept>chem</dept>
          .
          .
          .
        </ROW_stud>
      </TABLE_stud>
    </__readonly__>
  </stud>
</show>

However if I access the contents of other tags like 'dept' it seems to work fine.

stud_info = output_xml.find_all('row_stud')
for eachStud in range(len(stud_info)):

    print stud_info[eachStud].dept.get_text()   #Gives 'chem'
    print stud_info[eachStud].name.get_text()   #---Unicode Error---

Can any python/BeautifulSoup experts help me to resolve this? (I know BeautifulSoup is not ideal for parsing XML. But lets just say I'm compelled to use it )

Tag.name is an attribute containing the tag name; it's value here is row_stud .

Attribute access to contained tags is a shortcut for .find(attributename) , but only works if there isn't already an attribute in the API with the same name. Use .find() instead:

print stud_info[eachStud].find('name').get_text()

You can loop over the stud_info result list directly , no need to use range() here:

stud_info = output_xml.find_all('row_stud')
for eachStud in stud_info:
    print eachStud.dept.get_text()
    print eachStud.find('name').get_text()

I notice that you are searching for row_stud in lower-case. If you are parsing XML with BeautifulSoup, make sure that you have lxml installed and tell BeautifulSoup it is XML you are processing, so that it won't HTML-ize your tags (lowercase them):

soup = BeautifulSoup(source, 'xml')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM