简体   繁体   中英

python 2.7, xml, beautifulsoup4: only return matching parent tag

I'm trying to parse some XML, but am running into issues with forcing it to only select the request tag if it's a parent tag. For example, part of my XML is:

<Messages>
    <Message ChainCode="LI" HotelCode="5501" ConfirmationID="5501">
      <MessageContent>
        <OTA_HotelResNotifRQ TimeStamp="2014-01-24T21:02:43.9318703Z" Version="4" ResStatus="Book">
          <HotelReservations>
            <HotelReservation>
              <RoomStays>
                <RoomStay MarketCode="CC" SourceOfBusiness="CRS">
                  <RoomRates>
                    <RoomRate EffectiveDate="2014-02-04" ExpireDate="2014-02-06" RoomTypeCode="12112" NumberOfUnits="1" RatePlanCode="RAC">
                      <Rates>
                        <Rate EffectiveDate="2014-02-04" ExpireDate="2014-02-06" RateTimeUnit="Day" UnitMultiplier="3">
                          <Base AmountBeforeTax="749.25" CurrencyCode="USD" />
                          <Total AmountBeforeTax="749.25" CurrencyCode="USD" />
                        </Rate>
                      </Rates>
                    </RoomRate>
                  </RoomRates>
                  <Total AmountBeforeTax="2247.75" CurrencyCode="USD">
                    <Taxes Amount="0.00" />
                  </Total>
                </RoomStay>
              </RoomStays>
            </HotelReservation>
          </HotelReservations>
        </OTA_HotelResNotifRQ>
      </MessageContent>
    </Message>
  </Messages>

I've gotten the whole thing parsed how I need it with the exception of the "Total" tag.

The total tag I'm trying to get is:

 <Total AmountBeforeTax="2247.75" CurrencyCode="USD">
     <Taxes Amount="0.00" />
 </Total>

What's happening, is it's returning the "Total" tag that is a child of RoomRates\\RoomRate\\Rates\\Rate. I'm trying to figure out how to specify it to just return the RoomStays\\RoomStay\\Total tag. What I currently have is:

soup = bs(response, "xml")

messages = soup.find_all('Message')

for message in messages:
    hotel_code = message.get('HotelCode')

    reservations = message.find_all('HotelReservation')
    for reservation in reservations:
        uniqueid_id = reservation.UniqueID.get('ID')
        uniqueid_idcontext = reservation.UniqueID.get('ID_Context')

        roomstays = reservation.find_all('RoomStay')
        for roomstay in roomstays:

            total = roomstay.Total

Any ideas on how to specify the exact tag I'm trying to pull? If anyone is wondering about the for loops, it's because normally there are multiple "Message", "Hotel Reservation", "Room Stay", etc tags, but i've removed them to only show one. There can also sometimes be multiple Rate\\Rates tags, so I can't just ask it to give me the 2nd "Total" tag.

Hopefully I've explained this okay.

There can also sometimes be multiple Rate\\Rates tags, so I can't just ask it to give me the 2nd "Total" tag.

Why not just iterate over all the Total tags and skip the ones that have no Taxes child?

reservations = message.find_all('HotelReservation')
for reservation in reservations:
    totals = reservation.find_all('Total')
    for total in totals:
        if total.find('Taxes'):
             # do stuff
        else:
             # these aren't the totals you're looking for

If you more generally want to eliminate those that have no child nodes, you could do either of these:

if next(total.children, None):
    # it's a parent of something

if total.contents:
    # it's a parent of something

Or you could use a function instead of a string as your filter :

total = reservation.find(lambda node: node.name == 'Total' and node.contents)

Or you could look at other ways to locate this tag: it's a direct child of RoomStay rather than just a descendant; it's not a descendant of Rate ; it's the last Taxes descendant under a RoomStay ; etc. All of these can be done just as easily.


That being said, this seems like a perfect job for XPath, which BeautifulSoup doesn't support, but ElementTree and lxml do…

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM