简体   繁体   中英

Python Beautiful Soup Print exact TD tag

I am trying to use BS4 and I want to print the exact TD tag AUD/AED from the example below. I understand that I could use sometime of parsing like [-1] to always get the last one, but on some of the other data the TD tag I want will be in the middle. Is there a way I can call the AUD/AED tag specially.

Example:

<table class="RESULTS" width="100%">
<tr>
<th align="left">Base Currency</th>
<th align="left">Quote Currency</th>
<th align="left">Instrument</th>
<th align="left">Spot Date</th>
</tr>
<tr>
<td>AUD</td>
<td>AED</td>
<td>AUD/AED</td>
<td>Wednesday 23 APR 2014</td>
</tr>
</table>

Code I am using to get this:

soup = BeautifulSoup(r)
table = soup.find(attrs={"class": "RESULTS"})
print(table)
days = table.find_all('tr')

This will get the last TR tag, but I need to find the TR tag with the TD tag of AUD/AED

I am looking for something like:

if td[2] == <td>AUD/AED</td>:
    print(tr[-1])

This sort of thing is much (much) cleaner if you have a CSS selector to go off of, but it looks like we can't do that here.

The next-best alternative is just to explicitly find the tag you want:

soup.find(class_='RESULTS').find(text='AUD/AED')

And then navigate from there using the bs4 API.

tr = soup.find(class_='RESULTS').find(text='AUD/AED').parent.parent

import re

tr.find(text=re.compile(r'\w+ \d{1,2} \w+ \d{4}'))
Out[66]: 'Wednesday 23 APR 2014'

This sort of approach makes no assumptions about the layout of tr 's children, it just looks for siblings of the AUD/AED tag that look like a date (according to regex).

Something like this? Assuming soup is your table.

cellIndex = 0
cells = soup.find_all('td')
while cellIndex < len(cells):
    if cells[cellIndex].text == u'AUD/AED':
        desiredIndex = cellIndex + 1
        break
    cellIndex += 1
if cellIndex != len(cells):
     #desiredIndex was found
     print(cells[desiredIndex].text)
else:
     print("cell not found")

I'd probably use lxml and XPath:

from StringIO import StringIO
from lxml import etree

tree = etree.parse(StringIO(table), etree.HTMLParser())
d = tree.xpath("//table[@class='RESULTS']/tr[./td[3][text()='AUD/AED']]/td[4]/text()")[0]

The variable d should contain the string " Wednesday 23 APR 2014 ".

If you really want BeautifulSoup, you can mix lxml and BeautifulSoup, no problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM