简体   繁体   中英

Capture all capitalized words in a row in a capture group only if they are before the end of the string or if they are before a punctuation mark or \n

import re

def test_extraction_func(input_text):
    word = ""
    try_save_name = False #Not stop yet

    #Here I have tried to concatenate a substring at the end to try to establish it as a delimiter 
    # that allows me to identify whether or not it is the end of the sentence
    input_text_aux = input_text + "3hsdfhdg11sdf"

    name_capture_pattern_01 = r"([A-ZÁÉÍÓÚÜÑ][a-záéíóúüñ]+(?:\s*[A-ZÁÉÍÓÚÜÑ][a-záéíóúüñ]+)*)"

    regex_pattern_01 = r"(?i:no)\s*(?i:identifiques\s*como\s*(?:a\s*un|un|)\s*nombre\s*(?:a\s*el\s*nombre\s*(?:de|)|al\s*nombre|a)\s+)" + name_capture_pattern_01 + r"\s*(?:\.\s*\n|;|,|)" + r"\s*<3hsdfhdg11sdf"

    n1 = re.search(regex_pattern_01, input_text_aux)
    if n1 and word == "" and try_save_name == False:
        word, = n1.groups()
        if(word == None or word == "" or word == " "): 
            print("ERROR!!")
        else:
            try_save_name = True
            word = word.strip()
            print(repr(word))  # --> print the substring that I captured with the capturing group

    else: print(repr(word)) # --> NOT CAPTURED ""


input_text = "SAFJHDFH no identifiques como nombre a María del Carmen asjdjhs" #example 1 (NOT CAPTURE)
input_text = "no identifiques como nombre a María Carmen" #example 2 (CAPTURE)
input_text = "sagasghas no identifiques como a un nombre a María Carmen Azul, k9kfjfjfd" #example 3 (CAPTURE)
input_text = "sagasghas no identifiques como a un nombre a María Carmen Azul; Aun que no estoy realmente segura de ello" #example 4 (CAPTURE)
input_text = "no identifiques como nombre a María hghdshgsd" #example 5 (NOT CAPTURE)

test_extraction_func(input_text)

Extract more than one word beginning with a capital letter ([A-ZÁÉÍÓÚÜÑ][a-záéíóúüñ]+(?:\s*[A-ZÁÉÍÓÚÜÑ][a-záéíóúüñ]+)*) if this regex pattern is true, n1 == True , and if the capturing group is at the end of the sentence, or if this capturing group is followed by a boundary sentence punctuation, such as . \n . \n , .\n , . , , or ; (However, I have established this punctuation as optional, since many times it is omitted, even if the sentence does end.)

I've tried setting the end of the string by concatenating a generic delimiter "3hsdfhdg11sdf" and storing this inside a helper aux string input_text_aux , ie. concatenating some content that a user is unlikely to input in the input_text . However this did not work correctly, as it prevents any of the examples from being detected.

Note that not in all examples, the capture pattern will be valid, so these should be the correct console prints:

""                      #for the example 1
"María Carmen"          #for the example 2
"María Carmen Azul"     #for the example 3
"María Carmen Azul"     #for the example 4
""                      #for the example 5

The special delimiter string is making it impossible to match anything that is not at the end of the input. Moreover you have prefixed it with "<" in your regex, so that it will never match.

Instead of introducing this delimiter, make use of $ which matches with the end of the output.

The regex could have this as the part after name_capture_pattern_01 :

r"\s*(?:[.\n;,]|$)"

This will test for the end of a phrase. Add more characters (like ! or ? ) in the character class (between the square brackets) as desired.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM