[英]HTML parsing text in Python 3
我试图用Python 3.3从网页上获取文本,然后在该文本中搜索某些字符串。 找到匹配的字符串时,我需要保存以下文本。 例如,我进入以下页面: http : //gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy ,我需要将每个类别(卡片文本,稀有性等)之后的文本保存在卡信息 目前,我正在使用漂亮的汤,但是get_text导致UnicodeEncodeError,并且不返回可迭代的对象。 以下是相关代码:
urlStr = urllib.request.urlopen('http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName).read()
htmlRaw = BeautifulSoup(urlStr)
htmlText = htmlRaw.get_text
for line in htmlText:
line = line.strip()
if "Converted Mana Cost:" in line:
cmc = line.next()
message += "*Converted Mana Cost: " + cmc +"* \n\n"
elif "Types:" in line:
type = line.next()
message += "*Type: " + type +"* \n\n"
elif "Card Text:" in line:
rulesText = line.next()
message += "*Rules Text: " + rulesText +"* \n\n"
elif "Flavor Text:" in line:
flavor = line.next()
message += "*Flavor Text: " + flavor +"* \n\n"
elif "Rarity:" in line:
rarity == line.next()
message += "*Rarity: " + rarity +"* \n\n"
考虑改用lxml和xpath ,您将能够执行以下操作:
>>> from lxml import html
>>> root = html.parse("http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy")
>>> root.xpath('//div[contains(text(), "Flavor Text")]/following-sibling::div/div/i/text()')
['When the bog ran short on small animals, Ekri turned to the surrounding farmlands.']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.