简体   繁体   中英

Python pattern matching

I'm currently in the process of converting an old bash script of mine into a Python script with added functionality. I've been able to do most things, but I'm having a lot of trouble with Python pattern matching.

In my previous script, I downloaded a web page and used sed to get the elemented I wanted. The matching was done like so (for one of the values I wanted):

PM_NUMBER=`cat um.htm | LANG=sv_SE.iso88591 sed -n 's/.*ol.st.*pm.*count..\([0-9]*\).*/\1/p'`

It would match the number wrapped in <span class="count"></span> after the phrase "olästa pm". The markup I'm running this against is:

<td style="padding-left: 11px;">
    <a href="/abuse_list.php">
        <img src="/gfx/abuse_unread.png" width="15" height="12" alt="" title="9  anmälningar" />
    </a>
</td>
<td align="center">
    <a class="page_login_text" href="/pm.php" title="Du har 3 olästa pm.">
        <span class="count">3</span>
</td>
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/blogg_latest.php" title="Du har 1 ny bloggkommentar">
        <span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/user_guestbook.php" title="Min gästbok">
        <span class="count">1</span>
</td> 
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/forum.php?view=3" title="Du har 1 ny forumkommentar">
        <span class="count">1</span>
</td> 
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/user_images.php?user_id=162005&func=display_new_comments" title="Du har 1 ny albumkommentar">
        <span class="count">1</span>
</td> 
<td style="padding-left: 11px;" align="center">
    <a class="page_login_text" href="/forum_favorites.php" title="Du har 2 uppdaterade trådar i &quot;bevakade trådar&quot;">
        <span class="count">2</span>
</td>

I'm hesitant to post this, because it seems like I'm asking for a lot, but could someone please help me with a way to parse this in Python? I've been pulling my hair trying to do this, but regular expressions and I just don't match (pardon the pun). I've spent the last couple of hours experimenting and reading the Python manual on regular expressions, but I can't seem to figure it out.

Just to make it clear, what I need are 7 different expressions for matching the number within <span class="count"></span> . I need to, for example, be able to find the number of unread PMs ("olästa pm").

You will not parse html yourself. You will use a html parser built in python to parse the html.

You can user lxml to pull out the values you are looking for pretty easily with xpaths

Example

from lxml import html
page = html.fromstring(open("um.htm", "r").read())
matches = page.xpath("//a[contains(@title, 'pm.') or contains(@title, 'ol')]/span")
print [elem.text for elem in matches]

use either:

parsing HTML with regexes is a recipe for disaster.

It is impossible to reliably match HTML using regular expressions. It is usually possible to cobble something together that works for a specific page, but it is not advisable as even a subtle tweak to the source HTML can render all your work useless. HTML simply has a more complex structure than Regex is capable of describing.

The proper solution is to use a dedicated HTML parser. Note that even XML parsers won't do what you need, not reliably anyway. Valid XHTML is valid XML, but even valid HTML is not, even though it's quite similar. And valid HTML/XHTML is nearly impossible to find in the wild anyway.

There are a few different HTML parsers available:

  • BeautifulSoup is not in the standard library, but it is the most forgiving parser, it can handle almost all real-world HTML and it's designed to do exactly what you're trying to do.
  • HTMLParser is included in the Python standard library, but it is fairly strict about accepting only valid HTML.
  • htmllib is also in the standard library, but is deprecated.

As other people have suggested, BeautifulSoup is almost certainly your best choice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM