简体   繁体   中英

Using RegEx in vb.net

Here is what I need to do (for clarity) Take a PDF file (link on the bottom) Then parse only the information under each header into a DataFridView. I couldn't think of a way to do this (seeing as there is no native way to handle PDFs) So my only thought was to convert it to a txt document then (somehow) take the txt from the text document and put it into the datagridview.

So, using Itextsharp I first convert the PDF to a text file; Which keeps "most" of its formatting (see link below)

This is the source for that

 Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"
    Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"
    Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)
    Dim mPageCount As Integer = mPDFreader.NumberOfPages()
    Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)
    'Create the text file.
    Dim fs As FileStream = File.Create(mTXT)
    Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
    For i As Integer = 1 To mPageCount
        strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())
        Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())
        fs.Write(info, 0, info.Length)
    Next
    fs.Close()

however I only need the "lines" of information. So everything should look like this

63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS 64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS 65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS 66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS 67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS

In order to do that now I needed to use RegEx to remove everything I didn't want here is the RegEx I Used

The RegEx is 
(\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";

Here is the code I used.

Private Sub Fixtext()

        Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
        Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
            While (True)
                Dim line As String = reader.ReadLine()
                If line = Nothing Then
                    Return
                End If
                Dim match As Match = regex.Match(line)
                If match.Success Then
                    Dim value As String = match.Groups(1).Value
                    Console.WriteLine(line)
                End If
            End While
        End Using
End Sub

The results are "close" but not exactly the way I need it. In some cases they are "crammed" together and there are still parts left behind. An example would be

90 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS
491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS
Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS 
493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS

the format I actually need is (again) a format I can use to import the data later into a datagridview so for each line it needs to be

[number][ID][ID2][Date][Notes] 
[number][ID][ID2][Date][Notes]
[number][ID][ID2][Date][Notes] 
[number][ID][ID2][Date][Notes] 

using this "Concept" This is an example of what I need (though i know this doesn't work, but something along these lines that will work)

  Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
            Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
                While (True)
                    Dim line As String = reader.ReadLine()
                    If line = Nothing Then
                        Return
                    End If
                    Dim match As Match = regex.Match(line)
                    If match.Success Then
                        Dim value As String = match.Groups(1).Value
                        Dim s As String = value
                        s = s.Replace(" Tracking Id Forms Required Order Id RTS done on Notes", Nothing)
                        s = s.Replace("EXPRESS ", "EXPRESS")
                        s = s.Replace("EXPRESS", "EXPRESS" & vbCrLf)
                        Console.WriteLine(line)
                    End If
                End While
            End Using

Here is a "brief" explanation with files included.

Copy of the original PDF (This is the PDF being converted to.txt using itext) I am only doing this because I can't think of a way (outside of paying for a 3rd party tool to convert a pdf to XLS)

https://drive.google.com/file/d/1iHMM_G4UBUlKaa44-Wb00F_9ZdG-vYpM/view?usp=sharing

using the above "itext method" I mentioned this is the outputted converted file

https://drive.google.com/file/d/10dgJDFW5XlhsB0_0QAWQvtimsDoMllx-/view?usp=sharing

I then use the above Regex (mentioned above) to parse out what I don't need. however it isn't working.

So my Questions are (for "clarity")

  1. Is this the only or best method to do what I need done? (Convert PDF to text, Remove what I don't need then input that information into a DataGridView; Or is there another, Cleaner, Better method?

  2. (if not 1) How can I make this work? Is something wrong with my RegEx or My Logic? Am I missing something better/cleaner that someone can help me see.

  3. (if 2 ^ Not 1) What is the best way to take the results and place them in the proper DataGridView Column.

Final Statement: It doesn't have to be this method. I will take "ANY" method that will allow me to do what I need to be done, the cleaner the better however I have to do this avoiding 3rd party libraries that are free with limitations; Paid 3rd party libraries. That leaves me with limitations. IE: PDFBox, itext,itextsharp) And this has to be able to lead me from a PDF (like the above sample) to that table information in a Datagridview or even a listview.

I will take any help and I am more then appreciative. Also I did re-Ask this question because a mod closed my original question "Stating it wasn't clear what I needed" I did try in both cases to make the question as "thorough" as possible but I do hope this is "Clearer" so it doesn't get closed abruptly.

Try this regex and see if this works according to your requirement:

\b[0-9].*(FMPC|OD).*(EXPRESS|Replacement\sOrder)\b

I cheated a bit by correcting the text file. It goes a little wonky at page breaks and misses starting a new line. Perhaps you can correct that with Itextsharp or the hard to maintain regex.

I made a class to hold the data. The property names become the column headers in the DataGridView .

I read all the lines in the text file into an array. I checked the first character of the line to see if it was a digit then split the line into another array based on the space. Next I created a new Tracking object, fleshing it out with all its properties with the parameterized constructor.

Finally, I checked it the line contained a comma and added that bit of text to the notes parameter. The completed object is added to the list.

After the loop the lst is bound to the grid.

Public Class Tracking
    Public Property Number As Integer
    Public Property ID As String
    Public Property ID2 As String
    Public Property TrackDate As Date
    Public Property Notes As String
    Public Sub New(TNumber As Integer, TID As String, TID2 As String, TDate As DateTime, TNotes As String)
        Number = TNumber
        ID = TID
        ID2 = TID2
        TrackDate = TDate
        Notes = TNotes
    End Sub
End Class

Private Sub OPCode()
    Dim lst As New List(Of Tracking)
    Dim lines = File.ReadAllLines("C:\Users\maryo\Desktop\test.txt")
    For Each line In lines
        If Char.IsDigit(line(0)) Then
            Dim parts = line.Split(" "c)
            Dim T As New Tracking(CInt(parts(0)), parts(1), parts(2), Date.ParseExact($"{parts(3)} {parts(4)} {parts(5)} {parts(6)} {parts(7)}", "MMM d, yyyy hh:mm tt", CultureInfo.CurrentCulture), parts(8))
            If line.Contains(",") Then
                T.Notes &= line.Substring(line.IndexOf(","))
            End If
            lst.Add(T)
        End If
    Next
    DataGridView1.DataSource = lst
End Sub

EDIT
To pinpoint the error let's try...

Private Sub OPCode()
    Dim lst As New List(Of Tracking)
    Dim lines = File.ReadAllLines("C:\Users\maryo\Desktop\test.txt")
    For Each line In lines
        If Char.IsDigit(line(0)) Then
            Dim parts = line.Split(" "c)
            If parts.Length < 9 Then
                Debug.Print(line)
                MessageBox.Show($"We have a line that does not include all fields.")
                Exit Sub
            End If
            Dim T As New Tracking(CInt(parts(0)), parts(1), parts(2), Date.ParseExact($"{parts(3)} {parts(4)} {parts(5)} {parts(6)} {parts(7)}", "MMM d, yyyy hh:mm tt", CultureInfo.CurrentCulture), parts(8))
            If line.Contains(",") Then
                T.Notes &= line.Substring(line.IndexOf(","))
            End If
            lst.Add(T)
        End If
    Next
    DataGridView1.DataSource = lst
End Sub

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM