简体   繁体   中英

Exclude the beginning from the regex

I need regular expression which from something like this:

<li><a href="/wiki/%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1" title="ააგებს">ააგებს</a></li>

Will match:

%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1

So far I got:

<li><a href="/wiki/%.*\d

But I don't know how to exclude the beginning from the result. Any ideas? I'm using Python.

不确定正则表达式的味道,所以最好的猜测是:

/href="\/wiki\/((?:%[a-f0-9]{2})+)"/ig

If you are using a .NET language then you could do that more robustly than just using a regex to try to get the value. The HtmlAgilityPack is good for parsing HTML, even if the HTML is a bit malformed.

Here I have a function which tries to extract the href attribute of the first element in a piece of HTML, and then the rest of the program shows two ways you might extract the part of the href after "/wiki/":

Option Infer On

Imports System.Text.RegularExpressions
Imports HtmlAgilityPack

Module Module1

    ''' <summary>
    ''' Get the value of the href attribute in the first anchor (&lt;a>) element of (a fragment of) an HTML string.
    ''' </summary>
    ''' <param name="s">An HTML fragment.</param>
    ''' <returns>The value of the href attribute in the first anchor (&lt;a>) element.</returns>
    ''' <remarks>Throws a FormatException if the href value cannot be found.</remarks>
    Function GetHref(s As String) As String
        ' Get the value of the href attribute, if it exists, in a reliable fashion. '
        Dim htmlDoc As New HtmlDocument
        htmlDoc.LoadHtml(s)
        Dim link = htmlDoc.DocumentNode.SelectSingleNode("//a")
        Dim hrefValue = String.Empty

        If link IsNot Nothing Then
            If link.Attributes("href") IsNot Nothing Then
                hrefValue = link.Attributes("href").Value
            Else
                ' there was no href '
                Throw New FormatException("No href attribute in the <a> element.")
            End If
        Else
            ' there was no <a> element '
            Throw New FormatException("No <a> element.")
        End If

        Return hrefValue

    End Function

    Sub Main()
        Dim s = "<li><a href=""/wiki/%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1"" title=""ააგებს"">ააგებს</a></li>"

        Dim dataToCapture = String.Empty

        Dim hrefValue = GetHref(s)

        ' OPTION 1 - using RegEx
        ' Only get a specific pattern of characters
        Dim re = New Regex("^/wiki/((?:%[0-9A-F]{2})+)", RegexOptions.IgnoreCase)
        Dim m = re.Match(hrefValue)

        If m.Success Then
            dataToCapture = m.Groups(1).Value
            Console.WriteLine(dataToCapture)
        Else
            Console.WriteLine("Failed to match with RegEx.")
        End If

        ' OPTION 2 - looking at the string
        ' Just get whatever comes after the required start of the href value.
        Dim mustStartWith = "/wiki/"
        If hrefValue.StartsWith(mustStartWith) Then
            dataToCapture = hrefValue.Substring(mustStartWith.Length)
            Console.WriteLine(dataToCapture)
        Else
            Console.WriteLine("Nothing found with string operations.")
        End If

        ' the percent-encoded data could be decoded with System.Uri.UnescapeDataString(dataToCapture) '

        Console.ReadLine()

    End Sub

End Module

In a regex, parentheses, ie ( ) , indicate a group to capture. However, we don't need to capture the individual %AA parts, so those have a ?: modifier to indicate that they are non-capturing groups.

(The spurious 's are only to help SO colour the code properly.)

Seeing as you are using Python, you can use something like Python Regular Expression Testing Tool :

>>> regex = re.compile("href=\"/wiki/((?:%[0-9A-F]{2})+)\"",re.IGNORECASE)
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xd640db26af2f1d60>
>>> regex.match(string)
None

# List the groups found
>>> r.groups()
(u'%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1',)

# List the named dictionary objects found
>>> r.groupdict()
{}

# Run findall
>>> regex.findall(string)
[u'%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1']

where string is set to your example data.

However, similarly to what I showed for .NET, it would probably be better to parse the HTML with something like BeatifulSoup to get the value of the href and then work on that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM