简体   繁体   中英

Find a List of Tags Based on Text Value of Children in Beautiful Soup

I have a question about selecting a list of tags (or single tags) using a condition on one of the attributes of it's children. Specifically, given the HTML code:

<tbody>
<tr class="" data-row="0">
<tr class="" data-row="1">
<tr class="" data-row="2">
    <td align="right" csk="13">13</td>
    <td align="left" csk="Jones,Andre"><a href="/players/andre-jones-2.html">Andre Jones</a>       
    </td>
<tr class="" data-row="3">
    <td align="right" csk="7">7</td>
    <td align="left" csk="Jones,DeAndre"><a href="/players/deandre-jones-1.html">DeAndre Jones</a>
    </td>
 <tr class="" data-row="4">
 <tr class="" data-row="5">

I have a unicode variable coming from an external loop and I am trying to look through each row in the table to extract the <tr> tags with Player==Table.tr.a.text and to identify duplicate player names in Table . So, for instance, if there is more than one player with Player=Andre Jones the MyRow object returns all <tr> tags that contain that players name, while if there is only one row with Player=Andre Jones , then MyRow just contains the single element <tr> with anchor text attribute equal to Andre Jones . I've been trying things like

Table = soup.find('tbody')
MyRow = Table.find_all(lambda X: X.name=='tr' and Player == X.text)

But this returns [] for MyRow . If I use

MyRow = Table.find_all(lambda X: X.name=='tr' and Player in X.text)

This will pick any <tr> that has Player as a substring of X.text . In the example code above, it extracts both <tr> tags withe Table.tr.td.a.text=='Andre Jones' and Table.tr.td.a.text=='DeAndre Jones' . Any help would be appreciated.

You could do this easily with XPath and lxml:

import lxml.html

root = lxml.html.fromstring('''...''')
td = root.xpath('//tr[.//a[text() = "FooName"]]')

The BeautifulSoup "equivalent" would be something like:

rows = soup.find('tbody').find_all('tr')
td = next(row for row in rows if row.find('a', text='FooName'))

Or if you think about it backwards:

td = soup.find('a', text='FooName').find_parent('tr')

Whatever you desire. :)

Solution1

Logic: find the first tag whose tag name is tr and contains 'FooName' in this tag's text including its children.

# Exact Match  (text is unicode, turn into str)
print Table.find(lambda tag: tag.name=='tr' and 'FooName' == tag.text.encode('utf-8'))
# Fuzzy Match
# print Table.find(lambda tag: tag.name=='tr' and 'FooName' in tag.text)

Output:

<tr class="" data-row="2">
<td align="right" csk="3">3</td>
<td align="left" csk="Wentz,Parker">
<a href="/players/Foo-Name-1.html">FooName</a>
</td>
</tr>

Solution2

Logic: find the element whose text contains FooName , the anchor tag in this case. Then go up the tree and search for the all its parents(including ancestors) whose tag name is tr

# Exact Match
print Table.find(text='FooName').find_parent('tr')
# Fuzzy Match
# import re
# print Table.find(text=re.compile('FooName')).find_parent('tr')

Output

<tr class="" data-row="2">
<td align="right" csk="3">3</td>
<td align="left" csk="Wentz,Parker">
<a href="/players/Foo-Name-1.html">FooName</a>
</td>
</tr>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM