简体   繁体   中英

VB.Net Regex to get string

I have this String, and I want to get only the section that contains the real email without HTML tags. marked as (this line)

    Content-Type: multipart/alternative; boundary=001a11391134f9593b05083dbd67
    X-Antivirus: avast! (VPS 141119-1, 19/11/2014), Inbound message
    X-Antivirus-Status: Clean

    --001a11391134f9593b05083dbd67
    Content-Type: text/plain; charset=UTF-8

    (this lines) lorem ipsum (this lines)
    (this lines) dolor sit amet (this lines)

    --001a11391134f9593b05083dbd67
    Content-Type: text/html; charset=UTF-8

    <div dir="ltr">lorem ipsum dolor sit amet</div>

    --001a11391134f9593b05083dbd67--
    .

I think the regex is something like ^Content-Type: text/plain.*.?$ (until find two "--") but I don't know how to do it.

Thank you!

I'm no regex expert, so I may get the terminology wrong, but this should find the text/plain content up-to the next matching boundary (the \\1 to match the first capture group):

Dim content As String ' your string
Dim match = Regex.Match(
    content,
    "(\n--[0-9a-f]+)\nContent-Type: text/plain.*?\n\n(.*?)\1",
    RegexOptions.Multiline Or RegexOptions.Singleline
)
Dim textContent = match.Groups(2).Value

You'll probably need some error handling (maybe use Regex.Matches instead) and may need to adjust a few things for the real content.

Update

Here's the complete code to paste into LINQPad:

Dim content = <![CDATA[Content-Type: multipart/alternative; boundary=001a11391134f9593b05083dbd67
X-Antivirus: avast! (VPS 141119-1, 19/11/2014), Inbound message
X-Antivirus-Status: Clean

--001a11391134f9593b05083dbd67
Content-Type: text/plain; charset=UTF-8

(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)

--001a11391134f9593b05083dbd67
Content-Type: text/html; charset=UTF-8

<div dir="ltr">lorem ipsum dolor sit amet</div>

--001a11391134f9593b05083dbd67--
.]]>.Value

Dim match = RegEx.Match(content, "(\n--[0-9a-f]+)\nContent-Type: text/plain.*?\n\n(.*?)\1", RegexOptions.Multiline Or RegexOptions.Singleline)
Console.WriteLine("** Start **")
match.Groups(2).Value.Dump
Console.WriteLine("** End **")

And here's the output - I added the start and end to show that the blank line is also captured:

** Start **
(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)

** End **

After playing around with the expression I provided in my comment , it looks like noncapturing groups still get included in the match, so:

Dim match As Match = Regex.Match(input, "(Content-Type: text/plain; charset=UTF-8\s+)((?!\s+--).|\n)*")
Dim result As String = match.Groups(0).Value.Replace(match.Groups(1).Value, "")

Unfortunately it is not as clean as a straight expression match, but should return the result you're looking for. If you want to retain the left-edge spacing, as seen in your example, use this expression:

(Content-Type: text/plain; charset=UTF-8)((?!\s+--).|\n)*

This is not really something which a RegEx is going to be good at. What you need to do is find the boundary specifier and, using that, find the section which you want.

"Until I find two --" is doomed to failure as "dash dash space return" is used to indicate a signature follows, which the mail client should not include in a reply. Although I suspect that got lost in the '90s. And it would not be unusual for someone to use "--" in an email anyway.

Following is unrefined code which simply finds the first section. You just need to inspect the first line of the found data and check if it is what you want (probably Content-Type: text/plain; charset=UTF-8 or some other charset, which you may need to use). If not, try the next section:

Option Infer On

Imports System.IO

Module Module1

    Function GetBoundarySpecifier(s As String()) As String
        Dim boundarySpecifier = ""

        Dim boundarySpecifierMarker = "Content-Type: multipart/alternative; boundary="
        For i = 0 To s.Length - 1
            If s(i).StartsWith(boundarySpecifierMarker, StringComparison.InvariantCultureIgnoreCase) Then
                ' N.B. the boundary specifier may be enclosed in double-quotes - RFC 2046 section 5.1.1
                boundarySpecifier = s(i).Substring(boundarySpecifierMarker.Length).Trim(""""c)
            End If
        Next
        Return boundarySpecifier
    End Function

    Function LineIndex(stringToInspect As String(), soughtString As String, startIndex As Integer) As Integer
        ' find the first line starting at startIndex which matches the sought string
        For i = startIndex To stringToInspect.Length - 1
            If stringToInspect(i) = soughtString Then
                Return i
            End If
        Next

        Return -1

    End Function

    Sub Main()
        ' the sample data is stored in a text file for this example:
        Dim srcFile = "C:\temp\sampleEmail.txt"

        ' RFC 2821 section 2.3.7 specifies that lines end with CRLF
        Dim srcData = File.ReadAllLines(srcFile)

        Dim boundarySpecifier = GetBoundarySpecifier(srcData)

        If boundarySpecifier.Length > 0 Then
            boundarySpecifier = "--" & boundarySpecifier
            Dim idx1 = LineIndex(srcData, boundarySpecifier, 0)
            Dim idx2 = LineIndex(srcData, boundarySpecifier, idx1 + 1)
            Dim messageData = srcData.Skip(idx1 + 1).Take(idx2 - idx1 - 1)

            Console.WriteLine(String.Join(vbCrLf, messageData))
            Console.WriteLine("--end--")
        Else
            Console.WriteLine("Did not find the part.")
        End If

        Console.ReadLine()

    End Sub

End Module

Outputs:

Content-Type: text/plain; charset=UTF-8

(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)

--end--

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM