[英]Extract parenthesized acronyms and abbreviations based on letter count and length
I do realize this has already been addressed here (eg, Retrieve definition for parenthesized abbreviation, based on letter count ).我确实意识到这里已经解决了这个问题(例如,检索带括号的缩写的定义,基于字母计数)。 Nevertheless, I hope this question was different.不过,我希望这个问题有所不同。
I want to extract parenthesized acronyms and abbreviations from a given string.我想从给定的字符串中提取带括号的首字母缩写词和缩写词。
def extract_acronyms_abbreviations(text):
eaa = {}
for match in re.finditer(r"\((.*?)\)", text):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = text[:start_index].split()[-size:]
definition = " ".join(words)
eaa[abbr] = definition
return eaa
But the above function considers both (such as Apple's Siri, Amazon's Alexa, or Google's Voice Assistant)
, (MSWC)
as acronyms.但上面的 function 将两者(such as Apple's Siri, Amazon's Alexa, or Google's Voice Assistant)
、 (MSWC)
都视为首字母缩略词。 It will consider all characters in parenthesis as acronyms.它将括号中的所有字符视为首字母缩略词。
In my case, I want to extract all the abbreviations and acronyms, acronyms are capitalized and the length of acronyms inside the parenthesis is less than 8
.就我而言,我想提取所有缩写词和首字母缩略词,首字母缩略词大写,括号内的首字母缩略词长度小于8
。 And if there is any and I need to add one more word.如果有的话,我需要再添加一个词。
Sample Text:示例文本:
text = """Today we are extremely excited to announce the initial release of the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of spoken words in 50 different languages. These languages are collectively spoken by over 5 billion people and for most languages, this is the first publicly available, free-to-use dataset for training voice interfaces. It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices (such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant).Resource and Information Management (RIM)"""
Output Output
extract_acronyms_abbreviations(text)
{'MSWC': 'Multilingual Spoken Words Corpus',
'such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant': 'It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices',
'RIM': 'and Information Management'}
Desired Output所需 Output
{'MSWC': 'Multilingual Spoken Words Corpus',
'RIM': 'Resource and Information Management'}
You can change the regex to this: r"\(([AZ]{1,7})\)"
.您可以将正则表达式更改为: r"\(([AZ]{1,7})\)"
。 This will only match the capital letters AZ, and also makes sure that the acronym is 1 to 7 characters long.这将只匹配大写字母 AZ,并确保首字母缩略词的长度为 1 到 7 个字符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.