I'm just starting to learn how to use Beautiful Soup.
As an exercise, I picked this page from ESPN .
There's a table in there with NBA players and their fantasy ranks. I was able to print the whole row out and it shows everything I see in my browser.
However, when I go to print each cell by itself, it prints out "None" because for some reason, it can't parse a cell that contains an anchor
Here's my code below:
from bs4 import BeautifulSoup
import urllib2
import re
if __name__ == '__main__':
url = "http://www.espn.com/espn/print?id=20443164"
resp = urllib2.urlopen(url)
soup = BeautifulSoup(resp.read())
table = soup.find_all("table")
mytable = table[2]
rows = mytable.findChildren(['th','tr'])
print rows
for row in rows:
cells = row.findChildren('td')
for cell in cells:
# print cell.string # line in question
print cell # line in question
If I use
print cell
I get the following output:
<td>1. <a href="http://www.espn.com/nba/player/_/id/3032977/giannis-antetokounmpo">Giannis Antetokounmpo</a>, SF/PF</td>
<td>PHI</td>
<td>C24</td>
If I use
print cell.string
I get the following output:
None
MIL
SF1
So how can I make everything print out without the "td" tags but recognize everything in the first cell without printing "None"?
try this at your last loop. change cell.string
to cell.text
for cell in cells:
print cell.text
You can do something like this -
print (cell.text)
This will get you text inside the cell skipping all the tags init.
From the official documentation regarding .string
(emphasis mine):
.string
If a tag has only one child, and that child is a
NavigableString
, the child is made available as.string
If a tag's only child is another tag, and that tag has a
.string
, then the parent tag is considered to have the same.string
as its childIf a tag contains more than one thing, then it's not clear what
.string
should refer to, so.string
is defined to beNone
What they mean by If a tag contains more than one thing is that if a tag contains another tag, tag.string
evaluates to None
. That's the reason you are getting None
for first the <td>
tag in your code (as it contains another tag, <a>
).
So, to get the complete text of a tag, you can use get_text()
. So, in your code, use cell.get_text()
.
Or, for this case, you could also use cell.text
. .text
is the same as get_text()
, which you can see in the source code :
text = property(get_text)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.