简体   繁体   English

VB.Net正则表达式获取字符串

[英]VB.Net Regex to get string

I have this String, and I want to get only the section that contains the real email without HTML tags. 我有这个String,我只想获取包含真实电子邮件但没有HTML标签的部分。 marked as (this line) 标记为(此行)

    Content-Type: multipart/alternative; boundary=001a11391134f9593b05083dbd67
    X-Antivirus: avast! (VPS 141119-1, 19/11/2014), Inbound message
    X-Antivirus-Status: Clean

    --001a11391134f9593b05083dbd67
    Content-Type: text/plain; charset=UTF-8

    (this lines) lorem ipsum (this lines)
    (this lines) dolor sit amet (this lines)

    --001a11391134f9593b05083dbd67
    Content-Type: text/html; charset=UTF-8

    <div dir="ltr">lorem ipsum dolor sit amet</div>

    --001a11391134f9593b05083dbd67--
    .

I think the regex is something like ^Content-Type: text/plain.*.?$ (until find two "--") but I don't know how to do it. 我认为正则表达式类似于^ Content-Type:text / plain。*。?$ (直到找到两个“-”),但我不知道该怎么做。

Thank you! 谢谢!

I'm no regex expert, so I may get the terminology wrong, but this should find the text/plain content up-to the next matching boundary (the \\1 to match the first capture group): 我不是regex专家,所以我可能会误解术语,但这应该找到直到下一个匹配边界( \\1匹配第一个捕获组)的text/plain内容:

Dim content As String ' your string
Dim match = Regex.Match(
    content,
    "(\n--[0-9a-f]+)\nContent-Type: text/plain.*?\n\n(.*?)\1",
    RegexOptions.Multiline Or RegexOptions.Singleline
)
Dim textContent = match.Groups(2).Value

You'll probably need some error handling (maybe use Regex.Matches instead) and may need to adjust a few things for the real content. 您可能需要一些错误处理(可能使用Regex.Matches代替),并且可能需要为实际内容调整一些内容。

Update 更新资料

Here's the complete code to paste into LINQPad: 这是粘贴到LINQPad中的完整代码:

Dim content = <![CDATA[Content-Type: multipart/alternative; boundary=001a11391134f9593b05083dbd67
X-Antivirus: avast! (VPS 141119-1, 19/11/2014), Inbound message
X-Antivirus-Status: Clean

--001a11391134f9593b05083dbd67
Content-Type: text/plain; charset=UTF-8

(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)

--001a11391134f9593b05083dbd67
Content-Type: text/html; charset=UTF-8

<div dir="ltr">lorem ipsum dolor sit amet</div>

--001a11391134f9593b05083dbd67--
.]]>.Value

Dim match = RegEx.Match(content, "(\n--[0-9a-f]+)\nContent-Type: text/plain.*?\n\n(.*?)\1", RegexOptions.Multiline Or RegexOptions.Singleline)
Console.WriteLine("** Start **")
match.Groups(2).Value.Dump
Console.WriteLine("** End **")

And here's the output - I added the start and end to show that the blank line is also captured: 这是输出-我添加了开始和结束以显示空白行也已捕获:

** Start **
(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)

** End **

After playing around with the expression I provided in my comment , it looks like noncapturing groups still get included in the match, so: 在我在我的评论中提供的表达式周围玩耍之后,看起来非捕获组仍包含在比赛中,因此:

Dim match As Match = Regex.Match(input, "(Content-Type: text/plain; charset=UTF-8\s+)((?!\s+--).|\n)*")
Dim result As String = match.Groups(0).Value.Replace(match.Groups(1).Value, "")

Unfortunately it is not as clean as a straight expression match, but should return the result you're looking for. 不幸的是,它不像直接表达式匹配那么干净,但是应该返回您要查找的结果。 If you want to retain the left-edge spacing, as seen in your example, use this expression: 如果要保留左边缘间距(如您的示例所示),请使用以下表达式:

(Content-Type: text/plain; charset=UTF-8)((?!\s+--).|\n)*

This is not really something which a RegEx is going to be good at. RegEx不会擅长于此。 What you need to do is find the boundary specifier and, using that, find the section which you want. 您需要做的是找到边界说明符,并使用该说明符找到所需的部分。

"Until I find two --" is doomed to failure as "dash dash space return" is used to indicate a signature follows, which the mail client should not include in a reply. “直到找到两个-”注定要失败,因为“短划线返回”用于指示后面的签名,邮件客户端不应在回复中包括该签名。 Although I suspect that got lost in the '90s. 尽管我怀疑这在90年代迷失了。 And it would not be unusual for someone to use "--" in an email anyway. 无论如何,有人在电子邮件中使用“-”并不罕见。

Following is unrefined code which simply finds the first section. 以下是未精炼的代码,它们仅找到第一部分。 You just need to inspect the first line of the found data and check if it is what you want (probably Content-Type: text/plain; charset=UTF-8 or some other charset, which you may need to use). 您只需要检查找到的数据的第一行,然后检查它是否是您想要的(可能是Content-Type: text/plain; charset=UTF-8或其他可能需要使用的字符集)。 If not, try the next section: 如果没有,请尝试下一部分:

Option Infer On

Imports System.IO

Module Module1

    Function GetBoundarySpecifier(s As String()) As String
        Dim boundarySpecifier = ""

        Dim boundarySpecifierMarker = "Content-Type: multipart/alternative; boundary="
        For i = 0 To s.Length - 1
            If s(i).StartsWith(boundarySpecifierMarker, StringComparison.InvariantCultureIgnoreCase) Then
                ' N.B. the boundary specifier may be enclosed in double-quotes - RFC 2046 section 5.1.1
                boundarySpecifier = s(i).Substring(boundarySpecifierMarker.Length).Trim(""""c)
            End If
        Next
        Return boundarySpecifier
    End Function

    Function LineIndex(stringToInspect As String(), soughtString As String, startIndex As Integer) As Integer
        ' find the first line starting at startIndex which matches the sought string
        For i = startIndex To stringToInspect.Length - 1
            If stringToInspect(i) = soughtString Then
                Return i
            End If
        Next

        Return -1

    End Function

    Sub Main()
        ' the sample data is stored in a text file for this example:
        Dim srcFile = "C:\temp\sampleEmail.txt"

        ' RFC 2821 section 2.3.7 specifies that lines end with CRLF
        Dim srcData = File.ReadAllLines(srcFile)

        Dim boundarySpecifier = GetBoundarySpecifier(srcData)

        If boundarySpecifier.Length > 0 Then
            boundarySpecifier = "--" & boundarySpecifier
            Dim idx1 = LineIndex(srcData, boundarySpecifier, 0)
            Dim idx2 = LineIndex(srcData, boundarySpecifier, idx1 + 1)
            Dim messageData = srcData.Skip(idx1 + 1).Take(idx2 - idx1 - 1)

            Console.WriteLine(String.Join(vbCrLf, messageData))
            Console.WriteLine("--end--")
        Else
            Console.WriteLine("Did not find the part.")
        End If

        Console.ReadLine()

    End Sub

End Module

Outputs: 输出:

Content-Type: text/plain; charset=UTF-8

(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)

--end--

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM