简体   繁体   中英

How to print out all cells of a table row in Beautiful Soup

I'm just starting to learn how to use Beautiful Soup.

As an exercise, I picked this page from ESPN .

There's a table in there with NBA players and their fantasy ranks. I was able to print the whole row out and it shows everything I see in my browser.

However, when I go to print each cell by itself, it prints out "None" because for some reason, it can't parse a cell that contains an anchor

Here's my code below:

from bs4 import BeautifulSoup

import urllib2
import re

if __name__ == '__main__':
   url = "http://www.espn.com/espn/print?id=20443164"
   resp = urllib2.urlopen(url)
   soup = BeautifulSoup(resp.read())

   table = soup.find_all("table")
   mytable = table[2]
   rows = mytable.findChildren(['th','tr'])
   print rows
   for row in rows:
       cells = row.findChildren('td')
       for cell in cells:
#           print cell.string  # line in question
           print cell  # line in question

If I use

print cell

I get the following output:

<td>1. <a href="http://www.espn.com/nba/player/_/id/3032977/giannis-antetokounmpo">Giannis Antetokounmpo</a>, SF/PF</td>
<td>PHI</td>
<td>C24</td>

If I use

print cell.string

I get the following output:

None
MIL
SF1

So how can I make everything print out without the "td" tags but recognize everything in the first cell without printing "None"?

try this at your last loop. change cell.string to cell.text

for cell in cells:
    print cell.text

You can do something like this -

print (cell.text)

This will get you text inside the cell skipping all the tags init.

From the official documentation regarding .string (emphasis mine):

.string

  • If a tag has only one child, and that child is a NavigableString , the child is made available as .string

  • If a tag's only child is another tag, and that tag has a .string , then the parent tag is considered to have the same .string as its child

  • If a tag contains more than one thing, then it's not clear what .string should refer to, so .string is defined to be None

What they mean by If a tag contains more than one thing is that if a tag contains another tag, tag.string evaluates to None . That's the reason you are getting None for first the <td> tag in your code (as it contains another tag, <a> ).

So, to get the complete text of a tag, you can use get_text() . So, in your code, use cell.get_text() .

Or, for this case, you could also use cell.text . .text is the same as get_text() , which you can see in the source code :

text = property(get_text) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM