Im trying to get the text from a webpage with Python 3.3 and then search through that text for certain strings. When I find a matching string I need to save the following text. For example I take this page: http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy and I need to save the text after each category (card text, rarity, etc) in the card info. Currently Im using beautiful Soup but get_text causes a UnicodeEncodeError and doesnt return an iterable object. Here is the relevant code:
urlStr = urllib.request.urlopen(
'http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName
).read()
htmlRaw = BeautifulSoup(urlStr)
htmlText = htmlRaw.get_text
for line in htmlText:
line = line.strip()
if "Converted Mana Cost:" in line:
cmc = line.next()
message += "*Converted Mana Cost: " + cmc +"* \n\n"
elif "Types:" in line:
type = line.next()
message += "*Type: " + type +"* \n\n"
elif "Card Text:" in line:
rulesText = line.next()
message += "*Rules Text: " + rulesText +"* \n\n"
elif "Flavor Text:" in line:
flavor = line.next()
message += "*Flavor Text: " + flavor +"* \n\n"
elif "Rarity:" in line:
rarity == line.next()
message += "*Rarity: " + rarity +"* \n\n"
This is incorrect:
htmlText = htmlRaw.get_text
As get_text
is a method of the BeautifulSoup
class, you're assigning the method to htmlText
and not its result. There is a property variant of it that will do what you want here:
htmlText = htmlRaw.text
You're also using a HTML parser to simply strip tags, when you could use it to target the data you want:
# unique id for the html section containing the card info
card_id = 'ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rightCol'
# grab the html section with the card info
card_data = htmlRaw.find(id=card_id)
# create a generator to iterate over the rows
card_rows = ( row for row in card_data.find_all('div', 'row') )
# create a generator to produce functions for retrieving the values
card_rows_getters = ( lambda x: row.find('div', x).text.strip() for row in card_rows )
# create a generator to get the values
card_values = ( (get('label'), get('value')) for get in card_rows_getters )
# dump them into a dictionary
cards = dict( card_values )
print cards
{u'Artist:': u'Scott Chou',
u'Card Name:': u'Dark Prophecy',
u'Card Number:': u'93',
u'Card Text:': u'Whenever a creature you control dies, you draw a card and lose 1 life.',
u'Community Rating:': u'Community Rating: 3.617 / 5\xa0\xa0(64 votes)',
u'Converted Mana Cost:': u'3',
u'Expansion:': u'Magic 2014 Core Set',
u'Flavor Text:': u'When the bog ran short on small animals, Ekri turned to the surrounding farmlands.',
u'Mana Cost:': u'',
u'Rarity:': u'Rare',
u'Types:': u'Enchantment'}
Now you have a dictionary of the information you want (plus a few extra) which will be a lot easier to deal with.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.