简体   繁体   中英

Get text from webpage as iterable object in python 3.3

Im trying to get the text from a webpage with Python 3.3 and then search through that text for certain strings. When I find a matching string I need to save the following text. For example I take this page: http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy and I need to save the text after each category (card text, rarity, etc) in the card info. Currently Im using beautiful Soup but get_text causes a UnicodeEncodeError and doesnt return an iterable object. Here is the relevant code:

urlStr = urllib.request.urlopen(
    'http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName
    ).read()

htmlRaw = BeautifulSoup(urlStr)

htmlText = htmlRaw.get_text

for line in htmlText:
    line = line.strip()
    if "Converted Mana Cost:" in line:
        cmc = line.next()
        message += "*Converted Mana Cost: " + cmc +"* \n\n"
    elif "Types:" in line:
        type = line.next()
        message += "*Type: " + type +"* \n\n"
    elif "Card Text:" in line:
        rulesText = line.next()
        message += "*Rules Text: " + rulesText +"* \n\n"
    elif "Flavor Text:" in line:
        flavor = line.next()
        message += "*Flavor Text: " + flavor +"* \n\n"
    elif "Rarity:" in line:
        rarity == line.next()
        message += "*Rarity: " + rarity +"* \n\n"

This is incorrect:

htmlText = htmlRaw.get_text

As get_text is a method of the BeautifulSoup class, you're assigning the method to htmlText and not its result. There is a property variant of it that will do what you want here:

htmlText = htmlRaw.text

You're also using a HTML parser to simply strip tags, when you could use it to target the data you want:

# unique id for the html section containing the card info
card_id = 'ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rightCol'

# grab the html section with the card info
card_data = htmlRaw.find(id=card_id)

# create a generator to iterate over the rows
card_rows = ( row for row in card_data.find_all('div', 'row') )

# create a generator to produce functions for retrieving the values
card_rows_getters = ( lambda x: row.find('div', x).text.strip() for row in card_rows )

# create a generator to get the values
card_values = ( (get('label'), get('value')) for get in card_rows_getters )

# dump them into a dictionary
cards = dict( card_values )

print cards

{u'Artist:': u'Scott Chou',
 u'Card Name:': u'Dark Prophecy',
 u'Card Number:': u'93',
 u'Card Text:': u'Whenever a creature you control dies, you draw a card and lose 1 life.',
 u'Community Rating:': u'Community Rating: 3.617 / 5\xa0\xa0(64 votes)',
 u'Converted Mana Cost:': u'3',
 u'Expansion:': u'Magic 2014 Core Set',
 u'Flavor Text:': u'When the bog ran short on small animals, Ekri turned to the surrounding farmlands.',
 u'Mana Cost:': u'',
 u'Rarity:': u'Rare',
 u'Types:': u'Enchantment'}

Now you have a dictionary of the information you want (plus a few extra) which will be a lot easier to deal with.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM