在 vb.net 中使用 RegEx

Question

Here is what I need to do (for clarity) Take a PDF file (link on the bottom) Then parse only the information under each header into a DataFridView.这是我需要做的（为清楚起见）取一个 PDF 文件（链接在底部）然后只将每个 header 下的信息解析到 DataFridView 中。 I couldn't think of a way to do this (seeing as there is no native way to handle PDFs) So my only thought was to convert it to a txt document then (somehow) take the txt from the text document and put it into the datagridview.我想不出这样做的方法（因为没有处理 PDF 的本机方法）所以我唯一的想法是将它转换为 txt 文档，然后（以某种方式）从文本文档中取出 txt 并将其放入数据网格视图。

So, using Itextsharp I first convert the PDF to a text file;因此，我首先使用 Itextsharp 将 PDF 转换为文本文件； Which keeps "most" of its formatting (see link below)它保留了“大部分”的格式（见下面的链接）

This is the source for that这是那个的来源

 Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"
    Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"
    Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)
    Dim mPageCount As Integer = mPDFreader.NumberOfPages()
    Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)
    'Create the text file.
    Dim fs As FileStream = File.Create(mTXT)
    Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
    For i As Integer = 1 To mPageCount
        strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())
        Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())
        fs.Write(info, 0, info.Length)
    Next
    fs.Close()

however I only need the "lines" of information.但是我只需要信息的“行”。 So everything should look like this所以一切都应该是这样的

63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS 64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS 65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS 66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS 67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS 63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS 64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS 65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS 66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS 67 FMPC0847520947 OD119523760191783000 八月 28, 2020 02:19 PM 快递

In order to do that now I needed to use RegEx to remove everything I didn't want here is the RegEx I Used为了做到这一点，现在我需要使用 RegEx 删除我不想要的所有内容，这里是我使用的 RegEx

The RegEx is 
(\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";

Here is the code I used.这是我使用的代码。

Private Sub Fixtext()

        Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
        Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
            While (True)
                Dim line As String = reader.ReadLine()
                If line = Nothing Then
                    Return
                End If
                Dim match As Match = regex.Match(line)
                If match.Success Then
                    Dim value As String = match.Groups(1).Value
                    Console.WriteLine(line)
                End If
            End While
        End Using
End Sub

The results are "close" but not exactly the way I need it.结果是“接近”但不完全是我需要的方式。 In some cases they are "crammed" together and there are still parts left behind.在某些情况下，它们被“挤”在一起，但仍有部分遗留下来。 An example would be一个例子是

90 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS
491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS
Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS 
493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS

the format I actually need is (again) a format I can use to import the data later into a datagridview so for each line it needs to be我实际需要的格式（再次）是一种我可以用来稍后将数据导入数据网格视图的格式，因此对于每一行都需要

[number][ID][ID2][Date][Notes] 
[number][ID][ID2][Date][Notes]
[number][ID][ID2][Date][Notes] 
[number][ID][ID2][Date][Notes]

using this "Concept" This is an example of what I need (though i know this doesn't work, but something along these lines that will work)使用这个“概念”这是我需要的一个例子（虽然我知道这行不通，但这些方面的东西会起作用）

  Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
            Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
                While (True)
                    Dim line As String = reader.ReadLine()
                    If line = Nothing Then
                        Return
                    End If
                    Dim match As Match = regex.Match(line)
                    If match.Success Then
                        Dim value As String = match.Groups(1).Value
                        Dim s As String = value
                        s = s.Replace(" Tracking Id Forms Required Order Id RTS done on Notes", Nothing)
                        s = s.Replace("EXPRESS ", "EXPRESS")
                        s = s.Replace("EXPRESS", "EXPRESS" & vbCrLf)
                        Console.WriteLine(line)
                    End If
                End While
            End Using

Here is a "brief" explanation with files included.这是包含文件的“简要”说明。

Copy of the original PDF (This is the PDF being converted to.txt using itext) I am only doing this because I can't think of a way (outside of paying for a 3rd party tool to convert a pdf to XLS)原始 PDF 的副本（这是使用 itext 将 PDF 转换为.txt）我这样做只是因为我想不出办法（除了支付第三方工具将 pdf 转换为 XLS 之外）

https://drive.google.com/file/d/1iHMM_G4UBUlKaa44-Wb00F_9ZdG-vYpM/view?usp=sharing https://drive.google.com/file/d/1iHMM_G4UBUlKaa44-Wb00F_9ZdG-vYpM/view?usp=sharing

using the above "itext method" I mentioned this is the outputted converted file使用上面的“itext 方法”我提到这是输出的转换文件

https://drive.google.com/file/d/10dgJDFW5XlhsB0_0QAWQvtimsDoMllx-/view?usp=sharing https://drive.google.com/file/d/10dgJDFW5XlhsB0_0QAWQvtimsDoMllx-/view?usp=sharing

I then use the above Regex (mentioned above) to parse out what I don't need.然后我使用上面的正则表达式（上面提到的）来解析我不需要的东西。 however it isn't working.但是它不起作用。

So my Questions are (for "clarity")所以我的问题是（为了“清晰”）

Is this the only or best method to do what I need done?这是做我需要做的唯一或最好的方法吗？ (Convert PDF to text, Remove what I don't need then input that information into a DataGridView; Or is there another, Cleaner, Better method? （将 PDF 转换为文本，删除我不需要的内容，然后将该信息输入到 DataGridView 中；或者是否有另一种更清洁、更好的方法？
(if not 1) How can I make this work? （如果不是 1）我怎样才能完成这项工作？ Is something wrong with my RegEx or My Logic?我的正则表达式或我的逻辑有问题吗？ Am I missing something better/cleaner that someone can help me see.我是否缺少有人可以帮助我查看的更好/更清洁的东西。
(if 2 ^ Not 1) What is the best way to take the results and place them in the proper DataGridView Column. （如果 2 ^ 不是 1）获取结果并将它们放在适当的 DataGridView 列中的最佳方法是什么。

Final Statement: It doesn't have to be this method.最后声明：不一定是这种方法。 I will take "ANY" method that will allow me to do what I need to be done, the cleaner the better however I have to do this avoiding 3rd party libraries that are free with limitations;我将采用“任何”方法，允许我做我需要做的事情，越干净越好，但是我必须这样做，避免使用有限制的免费第 3 方库； Paid 3rd party libraries.付费第三方图书馆。 That leaves me with limitations.这给我留下了局限性。 IE: PDFBox, itext,itextsharp) And this has to be able to lead me from a PDF (like the above sample) to that table information in a Datagridview or even a listview. IE：PDFBox，itext，itextsharp）这必须能够引导我从 PDF（如上面的示例）到 Datagridview 甚至列表视图中的表信息。

I will take any help and I am more then appreciative.我会接受任何帮助，我会更加感激。 Also I did re-Ask this question because a mod closed my original question "Stating it wasn't clear what I needed" I did try in both cases to make the question as "thorough" as possible but I do hope this is "Clearer" so it doesn't get closed abruptly.我也确实重新问了这个问题，因为一个 mod 关闭了我原来的问题“说不清楚我需要什么”我确实在这两种情况下都尝试让问题尽可能“彻底”，但我希望这是“更清楚的” " 所以它不会突然关闭。

Answer 1

Try this regex and see if this works according to your requirement:试试这个正则表达式，看看它是否符合您的要求：

\b[0-9].*(FMPC|OD).*(EXPRESS|Replacement\sOrder)\b

Answer 2

I cheated a bit by correcting the text file.我通过更正文本文件作弊了一点。 It goes a little wonky at page breaks and misses starting a new line.它在分页符和错过开始新行时有点不稳定。 Perhaps you can correct that with Itextsharp or the hard to maintain regex.也许您可以使用 Itextsharp 或难以维护的正则表达式来纠正它。

I made a class to hold the data.我做了一个 class 来保存数据。 The property names become the column headers in the DataGridView .属性名称成为DataGridView中的列标题。

I read all the lines in the text file into an array.我将文本文件中的所有行读入一个数组。 I checked the first character of the line to see if it was a digit then split the line into another array based on the space.我检查了该行的第一个字符，看它是否是一个数字，然后根据空格将该行拆分为另一个数组。 Next I created a new Tracking object, fleshing it out with all its properties with the parameterized constructor.接下来，我创建了一个新的Tracking object，并使用参数化构造函数充实了它的所有属性。

Finally, I checked it the line contained a comma and added that bit of text to the notes parameter.最后，我检查了该行是否包含一个逗号并将该文本添加到 notes 参数中。 The completed object is added to the list.完成的object被添加到列表中。

After the loop the lst is bound to the grid.循环后lst被绑定到网格。

Public Class Tracking
    Public Property Number As Integer
    Public Property ID As String
    Public Property ID2 As String
    Public Property TrackDate As Date
    Public Property Notes As String
    Public Sub New(TNumber As Integer, TID As String, TID2 As String, TDate As DateTime, TNotes As String)
        Number = TNumber
        ID = TID
        ID2 = TID2
        TrackDate = TDate
        Notes = TNotes
    End Sub
End Class

Private Sub OPCode()
    Dim lst As New List(Of Tracking)
    Dim lines = File.ReadAllLines("C:\Users\maryo\Desktop\test.txt")
    For Each line In lines
        If Char.IsDigit(line(0)) Then
            Dim parts = line.Split(" "c)
            Dim T As New Tracking(CInt(parts(0)), parts(1), parts(2), Date.ParseExact($"{parts(3)} {parts(4)} {parts(5)} {parts(6)} {parts(7)}", "MMM d, yyyy hh:mm tt", CultureInfo.CurrentCulture), parts(8))
            If line.Contains(",") Then
                T.Notes &= line.Substring(line.IndexOf(","))
            End If
            lst.Add(T)
        End If
    Next
    DataGridView1.DataSource = lst
End Sub

EDIT编辑
To pinpoint the error let's try...为了查明错误，让我们尝试...

Private Sub OPCode()
    Dim lst As New List(Of Tracking)
    Dim lines = File.ReadAllLines("C:\Users\maryo\Desktop\test.txt")
    For Each line In lines
        If Char.IsDigit(line(0)) Then
            Dim parts = line.Split(" "c)
            If parts.Length < 9 Then
                Debug.Print(line)
                MessageBox.Show($"We have a line that does not include all fields.")
                Exit Sub
            End If
            Dim T As New Tracking(CInt(parts(0)), parts(1), parts(2), Date.ParseExact($"{parts(3)} {parts(4)} {parts(5)} {parts(6)} {parts(7)}", "MMM d, yyyy hh:mm tt", CultureInfo.CurrentCulture), parts(8))
            If line.Contains(",") Then
                T.Notes &= line.Substring(line.IndexOf(","))
            End If
            lst.Add(T)
        End If
    Next
    DataGridView1.DataSource = lst
End Sub

在 vb.net 中使用 RegEx

问题描述

2 个解决方案

解决方案1
1 2020-08-31 03:28:36

解决方案2
1 已采纳 2020-08-31 04:30:09

在 vb.net 中使用 RegEx

问题描述

2 个解决方案

解决方案1 1 2020-08-31 03:28:36

解决方案2 1 已采纳 2020-08-31 04:30:09

解决方案1
1 2020-08-31 03:28:36

解决方案2
1 已采纳 2020-08-31 04:30:09