繁体   English   中英

在Python 3中HTML解析文本

[英]HTML parsing text in Python 3

我试图用Python 3.3从网页上获取文本,然后在该文本中搜索某些字符串。 找到匹配的字符串时,我需要保存以下文本。 例如,我进入以下页面: http : //gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy ,我需要将每个类别(卡片文本,稀有性等)之后的文本保存在卡信息 目前,我正在使用漂亮的汤,但是get_text导致UnicodeEncodeError,并且不返回可迭代的对象。 以下是相关代码:

               urlStr = urllib.request.urlopen('http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName).read()

                htmlRaw = BeautifulSoup(urlStr)

                htmlText = htmlRaw.get_text

                for line in htmlText:
                    line = line.strip()
                    if "Converted Mana Cost:" in line:
                        cmc = line.next()
                        message += "*Converted Mana Cost: " + cmc +"* \n\n"
                    elif "Types:" in line:
                        type = line.next()
                        message += "*Type: " + type +"* \n\n"
                    elif "Card Text:" in line:
                        rulesText = line.next()
                        message += "*Rules Text: " + rulesText +"* \n\n"
                    elif "Flavor Text:" in line:
                        flavor = line.next()
                        message += "*Flavor Text: " + flavor +"* \n\n"
                    elif "Rarity:" in line:
                        rarity == line.next()
                        message += "*Rarity: " + rarity +"* \n\n"

考虑改用lxml和xpath ,您将能够执行以下操作:

>>> from lxml import html
>>> root = html.parse("http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy")
>>> root.xpath('//div[contains(text(), "Flavor Text")]/following-sibling::div/div/i/text()')
['When the bog ran short on small animals, Ekri turned to the surrounding farmlands.']

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM