简体   繁体   English

根据字母数和长度提取带括号的首字母缩写词和缩写词

[英]Extract parenthesized acronyms and abbreviations based on letter count and length

I do realize this has already been addressed here (eg, Retrieve definition for parenthesized abbreviation, based on letter count ).我确实意识到这里已经解决了这个问题(例如,检索带括号的缩写的定义,基于字母计数)。 Nevertheless, I hope this question was different.不过,我希望这个问题有所不同。

I want to extract parenthesized acronyms and abbreviations from a given string.我想从给定的字符串中提取带括号的首字母缩写词和缩写词。

def extract_acronyms_abbreviations(text):

    eaa = {}
    
    for match in re.finditer(r"\((.*?)\)", text):
        start_index = match.start()
        abbr = match.group(1)
        size = len(abbr)
        words = text[:start_index].split()[-size:]
        definition = " ".join(words)

        eaa[abbr] = definition


    return eaa

But the above function considers both (such as Apple's Siri, Amazon's Alexa, or Google's Voice Assistant) , (MSWC) as acronyms.但上面的 function 将两者(such as Apple's Siri, Amazon's Alexa, or Google's Voice Assistant)(MSWC)都视为首字母缩略词。 It will consider all characters in parenthesis as acronyms.它将括号中的所有字符视为首字母缩略词。

In my case, I want to extract all the abbreviations and acronyms, acronyms are capitalized and the length of acronyms inside the parenthesis is less than 8 .就我而言,我想提取所有缩写词和首字母缩略词,首字母缩略词大写,括号内的首字母缩略词长度小于8 And if there is any and I need to add one more word.如果有的话,我需要再添加一个词。

Sample Text:示例文本:

text = """Today we are extremely excited to announce the initial release of the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of spoken words in 50 different languages. These languages are collectively spoken by over 5 billion people and for most languages, this is the first publicly available, free-to-use dataset for training voice interfaces. It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices (such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant).Resource and Information Management (RIM)"""

Output Output

extract_acronyms_abbreviations(text)
{'MSWC': 'Multilingual Spoken Words Corpus',
 'such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant': 'It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices',
'RIM': 'and Information Management'}

Desired Output所需 Output

{'MSWC': 'Multilingual Spoken Words Corpus',
'RIM': 'Resource and Information Management'}

You can change the regex to this: r"\(([AZ]{1,7})\)" .您可以将正则表达式更改为: r"\(([AZ]{1,7})\)" This will only match the capital letters AZ, and also makes sure that the acronym is 1 to 7 characters long.这将只匹配大写字母 AZ,并确保首字母缩略词的长度为 1 到 7 个字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据字母计数检索带括号的缩写的定义 - Retrieve definition for parenthesized abbreviation, based on letter count 如何从 pandas dataframe 中提取首字母缩写词和缩写词? - How to extract acronyms and abbreviations from pandas dataframe? Python三个字母的首字母缩写 - Python Three Letter Acronyms 如何通过映射每个大写字母仅提取括号内首字母缩略词后的缩写 - How do i extract only abbreviation following acronyms inside the brackets by mapping each Capital letter 正则表达式根据月份缩写拆分文本并提取以下文本? - regex to split text based on month abbreviations and extract following text? 如何根据其他字符串的长度逐个字母地重复一个单词? - How to repeat a word letter by letter based on the length of other string? 如何计算基于字母的相似度 pandas dataframe - How to count letter based similarity on pandas dataframe 使用正则表达式从字符串中提取首字母缩写词模式 - Extract acronyms patterns from string using regex 如何提取带有重复字母的单词而不是一个特定的单词? python - How to extract words with a repeated letter not count one specific? python 将美国州名映射到字典中分别给出的两个字母首字母缩略词 - map US state name to two letter acronyms that was given in dictionary separately
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM