I have this String, and I want to get only the section that contains the real email without HTML tags. marked as (this line)
Content-Type: multipart/alternative; boundary=001a11391134f9593b05083dbd67
X-Antivirus: avast! (VPS 141119-1, 19/11/2014), Inbound message
X-Antivirus-Status: Clean
--001a11391134f9593b05083dbd67
Content-Type: text/plain; charset=UTF-8
(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)
--001a11391134f9593b05083dbd67
Content-Type: text/html; charset=UTF-8
<div dir="ltr">lorem ipsum dolor sit amet</div>
--001a11391134f9593b05083dbd67--
.
I think the regex is something like ^Content-Type: text/plain.*.?$ (until find two "--") but I don't know how to do it.
Thank you!
I'm no regex expert, so I may get the terminology wrong, but this should find the text/plain
content up-to the next matching boundary (the \\1
to match the first capture group):
Dim content As String ' your string
Dim match = Regex.Match(
content,
"(\n--[0-9a-f]+)\nContent-Type: text/plain.*?\n\n(.*?)\1",
RegexOptions.Multiline Or RegexOptions.Singleline
)
Dim textContent = match.Groups(2).Value
You'll probably need some error handling (maybe use Regex.Matches
instead) and may need to adjust a few things for the real content.
Update
Here's the complete code to paste into LINQPad:
Dim content = <![CDATA[Content-Type: multipart/alternative; boundary=001a11391134f9593b05083dbd67
X-Antivirus: avast! (VPS 141119-1, 19/11/2014), Inbound message
X-Antivirus-Status: Clean
--001a11391134f9593b05083dbd67
Content-Type: text/plain; charset=UTF-8
(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)
--001a11391134f9593b05083dbd67
Content-Type: text/html; charset=UTF-8
<div dir="ltr">lorem ipsum dolor sit amet</div>
--001a11391134f9593b05083dbd67--
.]]>.Value
Dim match = RegEx.Match(content, "(\n--[0-9a-f]+)\nContent-Type: text/plain.*?\n\n(.*?)\1", RegexOptions.Multiline Or RegexOptions.Singleline)
Console.WriteLine("** Start **")
match.Groups(2).Value.Dump
Console.WriteLine("** End **")
And here's the output - I added the start and end to show that the blank line is also captured:
** Start **
(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)
** End **
After playing around with the expression I provided in my comment , it looks like noncapturing groups still get included in the match, so:
Dim match As Match = Regex.Match(input, "(Content-Type: text/plain; charset=UTF-8\s+)((?!\s+--).|\n)*")
Dim result As String = match.Groups(0).Value.Replace(match.Groups(1).Value, "")
Unfortunately it is not as clean as a straight expression match, but should return the result you're looking for. If you want to retain the left-edge spacing, as seen in your example, use this expression:
(Content-Type: text/plain; charset=UTF-8)((?!\s+--).|\n)*
This is not really something which a RegEx is going to be good at. What you need to do is find the boundary specifier and, using that, find the section which you want.
"Until I find two --" is doomed to failure as "dash dash space return" is used to indicate a signature follows, which the mail client should not include in a reply. Although I suspect that got lost in the '90s. And it would not be unusual for someone to use "--" in an email anyway.
Following is unrefined code which simply finds the first section. You just need to inspect the first line of the found data and check if it is what you want (probably Content-Type: text/plain; charset=UTF-8
or some other charset, which you may need to use). If not, try the next section:
Option Infer On
Imports System.IO
Module Module1
Function GetBoundarySpecifier(s As String()) As String
Dim boundarySpecifier = ""
Dim boundarySpecifierMarker = "Content-Type: multipart/alternative; boundary="
For i = 0 To s.Length - 1
If s(i).StartsWith(boundarySpecifierMarker, StringComparison.InvariantCultureIgnoreCase) Then
' N.B. the boundary specifier may be enclosed in double-quotes - RFC 2046 section 5.1.1
boundarySpecifier = s(i).Substring(boundarySpecifierMarker.Length).Trim(""""c)
End If
Next
Return boundarySpecifier
End Function
Function LineIndex(stringToInspect As String(), soughtString As String, startIndex As Integer) As Integer
' find the first line starting at startIndex which matches the sought string
For i = startIndex To stringToInspect.Length - 1
If stringToInspect(i) = soughtString Then
Return i
End If
Next
Return -1
End Function
Sub Main()
' the sample data is stored in a text file for this example:
Dim srcFile = "C:\temp\sampleEmail.txt"
' RFC 2821 section 2.3.7 specifies that lines end with CRLF
Dim srcData = File.ReadAllLines(srcFile)
Dim boundarySpecifier = GetBoundarySpecifier(srcData)
If boundarySpecifier.Length > 0 Then
boundarySpecifier = "--" & boundarySpecifier
Dim idx1 = LineIndex(srcData, boundarySpecifier, 0)
Dim idx2 = LineIndex(srcData, boundarySpecifier, idx1 + 1)
Dim messageData = srcData.Skip(idx1 + 1).Take(idx2 - idx1 - 1)
Console.WriteLine(String.Join(vbCrLf, messageData))
Console.WriteLine("--end--")
Else
Console.WriteLine("Did not find the part.")
End If
Console.ReadLine()
End Sub
End Module
Outputs:
Content-Type: text/plain; charset=UTF-8
(this lines) lorem ipsum (this lines)
(this lines) dolor sit amet (this lines)
--end--
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.