First Things First: The macro is run from Excel VBA editor, but performs the biggest part of its job on the previously opened Word document, where it's goal is to find the full names of the people who are the contracting parties in the agreement being analized.
The issue I'm experiencing with the code is that it is variable number of words, that I need to pull from every consecutive paragraph. If the name is Will SMITH, then its two words I need to pull, when it's Carrie Ann MOSS, then it's three words, sometimes it can be Anna Nicole SMITH BURKE, than its four words but when it's Anna Nicole SMITH-BURKE, than its five words and so on.
The other idea to get this full name is, that it always ends with a coma, and this coma is always the first coma in this paragraph, where the full name appears.
ATTENTION !!! The Paragraphs we work with are not ListParagraphs
. They are the normal/ordinary ones albeit indented and numbered. I get these contracts from people who don't care to use numbered list :-(
So for the last time: The numbered list is not enabled on those paragraphs we work with.
This is how it looks like in Word and the selected words are the names and surnames that the macro is supposed to extract from the document - excluding the coma after the last surname.
Sub FindNamesCleanDraftWithLoop()
'Variables declaration
Dim WordApp As Word.Application
Dim WordDoc As Word.Document
Dim ExcelApp As Excel.Application
Dim MySheet As Excel.Worksheet
Dim Para As Word.Paragraph
Dim Rng As Word.Range
Dim RngStart As Word.Range
Dim RngEnd As Word.Range
Dim TextToFind1 As String
Dim TextToFind2 As String
Dim firstName As String
Dim startPos As Long
Dim endPos As Long
Application.ScreenUpdating = False
'Assigning object variables
Set WordApp = GetObject(, "Word.Application")
Set ExcelApp = GetObject(, "Excel.Application")
Set WordDoc = WordApp.ActiveDocument
Set MySheet = Application.ActiveWorkbook.ActiveSheet
'Set MySheet = ExcelApp.ActiveWorkbook.ActiveSheet
Set Rng = WordApp.ActiveDocument.Content
TextToFind1 = "REGON 364061169, NIP 951-24-09-783,"
TextToFind2 = "- ad."
'InStr function returns a Variant (Long) specifying the position of the first occurrence of one string within another.
startPos = InStr(1, Rng, TextToFind1) - 1 'here we get 1421, we're looking 4 "TextToFind1"
endPos = InStr(1, Rng, TextToFind2) - 1 'here we get 2246, we're looking 4 "- ad."
If startPos = 0 Or endPos = 0 Then Exit Sub
Rng.SetRange Start:=startPos, End:=endPos
Debug.Print Rng.Paragraphs.Count
If startPos = 0 Or endPos = 0 Then
MsgBox ("Client's names were not found!")
Else
'somewhere here I need your help to write some lines that will
'recognize how many words need to be pulled to extract the full
'name/names + surname/surnames and nothing else - we end on the first coma.
For Each Para In Rng.Paragraphs
firstName = Trim$(Para.Range.Words(3))
Debug.Print Para.Range.Words(1) & Para.Range.Words(2) & _
Para.Range.Words(3) & Para.Range.Words(4) & _
Para.Range.Words(5) & Para.Range.Words(6)
Next Para
End If
End Sub
There in the For Each Para ... Next Para
loop, I need your help to write some lines that will recognize how many words need to be pulled to extract the full name/names + surname/surnames and nothing else - we end on the first coma - that means excluding the coma after the last surname.
This is actually question on Natural Language Processing - NLP (as scientific field) - problem of Tokenization you already solved, but you are now facing with: Part of Sentence (POS) tagging (and construction of semantic Chunk-s). Issue of Person Identity recognition is one of first since very dawn of the discipline, so, there are many libraries for this, but I'm skeptic you'll find anything you can use from VBA that easy. Also, for comprehensive/robust solution you'll need proper lexicon with names and lastnames (suitable resource would be a Morpho-sintactic Lexicon/Dictionary, extracted only for PName LName tags).
To solve this, you'll have to do some research on keywords mentioned above. I have no experience with VBA, but asume C# and .NET would be the easiest to integrate. Therefore, Standford CoreNLP for .NET would be starting point I recon: https://sergey-tihon.github.io/Stanford.NLP.NET/StanfordCoreNLP.html
Highly relevant lexical source for English: https://wordnet.princeton.edu/
You can query WordNet lexicon using WordNet.NET . I've used both and its very good resource&library.
This might do what you want, if I have understood correctly.
If the Number
and Dot
are NOT part of the paragraph, then to extract the full name, you could use:
Debug.Print Left(Para, InStr(Para, ",") - 1)
If the Number
and Dot
ARE part of the paragraph, then:
Dim Start As Long, Length As Long
Start = InStr(Para, ".") + 1
Length = InStr(Para, ",") - Start
Debug.Print Trim(Mid(Para, Start, Length))
If you want to split the capitalized portion of the name from the rest, post several samples of the actual strings as TEXT as I am unable to copy/paste your screenshots into Excel.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.