简体   繁体   中英

Extract parenthesized acronyms and abbreviations based on letter count and length

I do realize this has already been addressed here (eg, Retrieve definition for parenthesized abbreviation, based on letter count ). Nevertheless, I hope this question was different.

I want to extract parenthesized acronyms and abbreviations from a given string.

def extract_acronyms_abbreviations(text):

    eaa = {}
    
    for match in re.finditer(r"\((.*?)\)", text):
        start_index = match.start()
        abbr = match.group(1)
        size = len(abbr)
        words = text[:start_index].split()[-size:]
        definition = " ".join(words)

        eaa[abbr] = definition


    return eaa

But the above function considers both (such as Apple's Siri, Amazon's Alexa, or Google's Voice Assistant) , (MSWC) as acronyms. It will consider all characters in parenthesis as acronyms.

In my case, I want to extract all the abbreviations and acronyms, acronyms are capitalized and the length of acronyms inside the parenthesis is less than 8 . And if there is any and I need to add one more word.

Sample Text:

text = """Today we are extremely excited to announce the initial release of the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of spoken words in 50 different languages. These languages are collectively spoken by over 5 billion people and for most languages, this is the first publicly available, free-to-use dataset for training voice interfaces. It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices (such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant).Resource and Information Management (RIM)"""

Output

extract_acronyms_abbreviations(text)
{'MSWC': 'Multilingual Spoken Words Corpus',
 'such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant': 'It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices',
'RIM': 'and Information Management'}

Desired Output

{'MSWC': 'Multilingual Spoken Words Corpus',
'RIM': 'Resource and Information Management'}

You can change the regex to this: r"\(([AZ]{1,7})\)" . This will only match the capital letters AZ, and also makes sure that the acronym is 1 to 7 characters long.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM