简体   繁体   中英

pull full names and surname/surnames form the consecutive paragraphs in Excel / Word VBA

First Things First: The macro is run from Excel VBA editor, but performs the biggest part of its job on the previously opened Word document, where it's goal is to find the full names of the people who are the contracting parties in the agreement being analized.

The issue I'm experiencing with the code is that it is variable number of words, that I need to pull from every consecutive paragraph. If the name is Will SMITH, then its two words I need to pull, when it's Carrie Ann MOSS, then it's three words, sometimes it can be Anna Nicole SMITH BURKE, than its four words but when it's Anna Nicole SMITH-BURKE, than its five words and so on.

The other idea to get this full name is, that it always ends with a coma, and this coma is always the first coma in this paragraph, where the full name appears.

ATTENTION !!! The Paragraphs we work with are not ListParagraphs . They are the normal/ordinary ones albeit indented and numbered. I get these contracts from people who don't care to use numbered list :-(
So for the last time: The numbered list is not enabled on those paragraphs we work with.

This is how it looks like in Word and the selected words are the names and surnames that the macro is supposed to extract from the document - excluding the coma after the last surname.

在此处输入图片说明

Sub FindNamesCleanDraftWithLoop()
    'Variables declaration
    Dim WordApp As Word.Application
    Dim WordDoc As Word.Document
    Dim ExcelApp As Excel.Application
    Dim MySheet As Excel.Worksheet
    Dim Para As Word.Paragraph
    Dim Rng As Word.Range
    Dim RngStart As Word.Range
    Dim RngEnd As Word.Range

    Dim TextToFind1 As String
    Dim TextToFind2 As String
    Dim firstName As String
    Dim startPos As Long
    Dim endPos As Long

    Application.ScreenUpdating = False

    'Assigning object variables
    Set WordApp = GetObject(, "Word.Application")
    Set ExcelApp = GetObject(, "Excel.Application")
    Set WordDoc = WordApp.ActiveDocument
    Set MySheet = Application.ActiveWorkbook.ActiveSheet
    'Set MySheet = ExcelApp.ActiveWorkbook.ActiveSheet
    Set Rng = WordApp.ActiveDocument.Content
    TextToFind1 = "REGON 364061169, NIP 951-24-09-783,"
    TextToFind2 = "- ad."

    'InStr function returns a Variant (Long) specifying the position of the first occurrence of one string within another.
    startPos = InStr(1, Rng, TextToFind1) - 1    'here we get 1421, we're looking 4 "TextToFind1"
    endPos = InStr(1, Rng, TextToFind2) - 1      'here we get 2246, we're looking 4 "- ad."
    If startPos = 0 Or endPos = 0 Then Exit Sub
    Rng.SetRange Start:=startPos, End:=endPos
    Debug.Print Rng.Paragraphs.Count

    If startPos = 0 Or endPos = 0 Then
        MsgBox ("Client's names were not found!")
    Else
        'somewhere here I need your help to write some lines that will
        'recognize how many words need to be pulled to extract the full
        'name/names + surname/surnames and nothing else - we end on the first coma.
        For Each Para In Rng.Paragraphs
            firstName = Trim$(Para.Range.Words(3))
            Debug.Print Para.Range.Words(1) & Para.Range.Words(2) & _
                        Para.Range.Words(3) & Para.Range.Words(4) & _
                        Para.Range.Words(5) & Para.Range.Words(6)
        Next Para
    End If
End Sub

There in the For Each Para ... Next Para loop, I need your help to write some lines that will recognize how many words need to be pulled to extract the full name/names + surname/surnames and nothing else - we end on the first coma - that means excluding the coma after the last surname.

This is actually question on Natural Language Processing - NLP (as scientific field) - problem of Tokenization you already solved, but you are now facing with: Part of Sentence (POS) tagging (and construction of semantic Chunk-s). Issue of Person Identity recognition is one of first since very dawn of the discipline, so, there are many libraries for this, but I'm skeptic you'll find anything you can use from VBA that easy. Also, for comprehensive/robust solution you'll need proper lexicon with names and lastnames (suitable resource would be a Morpho-sintactic Lexicon/Dictionary, extracted only for PName LName tags).

To solve this, you'll have to do some research on keywords mentioned above. I have no experience with VBA, but asume C# and .NET would be the easiest to integrate. Therefore, Standford CoreNLP for .NET would be starting point I recon: https://sergey-tihon.github.io/Stanford.NLP.NET/StanfordCoreNLP.html

Highly relevant lexical source for English: https://wordnet.princeton.edu/

You can query WordNet lexicon using WordNet.NET . I've used both and its very good resource&library.

This might do what you want, if I have understood correctly.

If the Number and Dot are NOT part of the paragraph, then to extract the full name, you could use:

Debug.Print Left(Para, InStr(Para, ",") - 1)

If the Number and Dot ARE part of the paragraph, then:

   Dim Start As Long, Length As Long
Start = InStr(Para, ".") + 1
Length = InStr(Para, ",") - Start

Debug.Print Trim(Mid(Para, Start, Length))

If you want to split the capitalized portion of the name from the rest, post several samples of the actual strings as TEXT as I am unable to copy/paste your screenshots into Excel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM