简体   繁体   English

Python模式匹配

[英]Python pattern matching

I'm currently in the process of converting an old bash script of mine into a Python script with added functionality. 我目前正在将我的旧bash脚本转换为具有附加功能的Python脚本。 I've been able to do most things, but I'm having a lot of trouble with Python pattern matching. 我已经能够完成大部分工作,但我在Python模式匹配方面遇到了很多麻烦。

In my previous script, I downloaded a web page and used sed to get the elemented I wanted. 在我之前的脚本中,我下载了一个网页并使用sed来获取我想要的元素。 The matching was done like so (for one of the values I wanted): 匹配是这样完成的(对于我想要的其中一个值):

PM_NUMBER=`cat um.htm | LANG=sv_SE.iso88591 sed -n 's/.*ol.st.*pm.*count..\([0-9]*\).*/\1/p'`

It would match the number wrapped in <span class="count"></span> after the phrase "olästa pm". 它将匹配短语“olästaypm”后面的<span class="count"></span>中包含的数字。 The markup I'm running this against is: 我正在运行此标记的标记是:

<td style="padding-left: 11px;">
    <a href="/abuse_list.php">
        <img src="/gfx/abuse_unread.png" width="15" height="12" alt="" title="9  anmälningar" />
    </a>
</td>
<td align="center">
    <a class="page_login_text" href="/pm.php" title="Du har 3 olästa pm.">
        <span class="count">3</span>
</td>
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/blogg_latest.php" title="Du har 1 ny bloggkommentar">
        <span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/user_guestbook.php" title="Min gästbok">
        <span class="count">1</span>
</td> 
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/forum.php?view=3" title="Du har 1 ny forumkommentar">
        <span class="count">1</span>
</td> 
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/user_images.php?user_id=162005&func=display_new_comments" title="Du har 1 ny albumkommentar">
        <span class="count">1</span>
</td> 
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/forum_favorites.php" title="Du har 2 uppdaterade trådar i &quot;bevakade trådar&quot;">
        <span class="count">2</span>
</td>

I'm hesitant to post this, because it seems like I'm asking for a lot, but could someone please help me with a way to parse this in Python? 我很犹豫发布这个,因为看起来我要求很多,但有人可以帮我解决一下用Python解析这个问题吗? I've been pulling my hair trying to do this, but regular expressions and I just don't match (pardon the pun). 我一直在拉我的头发试图这样做,但正则表达式和我只是不匹配(原谅双关语)。 I've spent the last couple of hours experimenting and reading the Python manual on regular expressions, but I can't seem to figure it out. 我花了最后几个小时在正则表达式上试验和阅读Python手册,但我似乎无法弄明白。

Just to make it clear, what I need are 7 different expressions for matching the number within <span class="count"></span> . 为了说清楚,我需要的是7个不同的表达式,用于匹配<span class="count"></span> I need to, for example, be able to find the number of unread PMs ("olästa pm"). 例如,我需要能够找到未读PM的数量(“olästafat”)。

You will not parse html yourself. 你不会自己解析HTML。 You will use a html parser built in python to parse the html. 您将使用python中构建的html解析器来解析html。

You can user lxml to pull out the values you are looking for pretty easily with xpaths 您可以使用lxml通过xpath轻松提取您正在查找的值

Example

from lxml import html
page = html.fromstring(open("um.htm", "r").read())
matches = page.xpath("//a[contains(@title, 'pm.') or contains(@title, 'ol')]/span")
print [elem.text for elem in matches]

use either: 使用:

parsing HTML with regexes is a recipe for disaster. 用正则表达式解析HTML是一种灾难。

It is impossible to reliably match HTML using regular expressions. 使用正则表达式无法可靠地匹配HTML。 It is usually possible to cobble something together that works for a specific page, but it is not advisable as even a subtle tweak to the source HTML can render all your work useless. 通常可以将一些适用于特定页面的内容拼凑在一起,但这是不可取的,因为即使对源HTML进行微妙调整也会使您的所有工作无效。 HTML simply has a more complex structure than Regex is capable of describing. HTML只是具有比Regex能够描述的更复杂的结构。

The proper solution is to use a dedicated HTML parser. 正确的解决方案是使用专用的HTML解析器。 Note that even XML parsers won't do what you need, not reliably anyway. 请注意,即使是XML解析器也无法满足您的需求,无论如何都不可靠。 Valid XHTML is valid XML, but even valid HTML is not, even though it's quite similar. 有效的XHTML是有效的XML,但即使它非常相似,即使是有效的HTML也不是。 And valid HTML/XHTML is nearly impossible to find in the wild anyway. 无论如何,有效的HTML / XHTML几乎不可能在野外找到。

There are a few different HTML parsers available: 有几种不同的HTML解析器可用:

  • BeautifulSoup is not in the standard library, but it is the most forgiving parser, it can handle almost all real-world HTML and it's designed to do exactly what you're trying to do. BeautifulSoup不在标准库中,但它是最宽容的解析器,它可以处理几乎所有真实的HTML,它的设计完全符合您的要求。
  • HTMLParser is included in the Python standard library, but it is fairly strict about accepting only valid HTML. HTMLParser包含在Python标准库中,但对于仅接受有效的HTML非常严格。
  • htmllib is also in the standard library, but is deprecated. htmllib也在标准库中,但已弃用。

As other people have suggested, BeautifulSoup is almost certainly your best choice. 正如其他人所说,BeautifulSoup几乎肯定是您的最佳选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM